Sampling Novel Sequences

Hello Team,

In the RNN shown in the lecture, A<0> and X<0> are filled with zeros in the start of the training and the first layer should predict yhat<1> which is the probability vector which has the shape of the length of the dictionary. So, we need to select the word with highest probability which will be the next word.
The Model will give out bad results in the start but will improve eventually. Does this makes sense? How does Sampling from the output distribution help because you are selecting at random which has the least probability of being the next word?

In this example, we use random sampling of the output in order to get a wider variety of outputs.
If you only use the output with the highest probability, you only get one prediction, and that’s not very interesting if you’re trying to generate a lot of possible names.

The Point I wanted to mention was if the A<1> was “harry” then the next word should be “Potter” to be the best fit. We don’t want Harry Arora or Harry Kane right. Suppose if there are only 3 words in the entire dictionary.

Whether “potter” is the best fit all depends on what your training data looks like, right? Does it contain any other last names paired with “harry”? The point is that the dictionary is not all that matters: that’s just your vocabulary. The training “corpus” tells you how that vocabulary is combined to form sentences and that’s what is used to train the weights in the RNN node.

So, the solution you provided is not in the context of Next word prediction because the next word prediction works with having the highest probability. is that right?

Sorry, I think we are talking at cross purposes here. Yes, this is all based on selecting high probability next words: that’s what the model is trained to do. But Tom’s earlier point is that when you use the trained model to do the sampling, they introduce another level of randomness by selecting randomly from the highest probability possible words, so that we don’t get the same result if we rerun with the same inputs. They don’t just always select the single highest probability next word or next letter in the case of the dinosaur names.

Got it. Thanks @paulinpaloalto