Why take random sample for prediction and not the sample with maximum probability

In the lecture of “Sampling a Novel Sequence”, our beloved Andrew explains that you take a random sample from all the softmax probabilties at each time step.
Why not take the pick the character with maximum probability instead of a random sample? Even if we want to sample, why can’t we take say, first 10% highest probabilities and then sample from them?

Hey @Narasimhan,
Welcome to the community. If we pick the character/word with the maximum probability at each time-step, then we will always sample the same sequence, since when we want to sample a novel sequence, we start with the same inputs, i.e. zeros. I suppose this rules out selecting the most likely character/word at each time-step.

Now, the above you can definitely do, i.e., you can consider the 10% most likely characters/words, and you can select from this subset as per their probability distribution. Whether this will produce better or worse novel sequences, that you have to implement and test it for yourself, since in some cases, when the vocabulary is too large, I believe this method can still give diverse results, but when the vocabulary is too small, this might restrict the diversity in the outputs.

I hope this helps.


Thanks for the quick response!

How are we sampling the same sequence? One of the inputs is zero, initially but the other input could be any random word/character. Also, if we’re choosing the most likely prediction (by choosing the max probability), how are we missing out on most likely character/word at each time step?
Yes, I agree if the character/word input is the same every time it’s the same sequence and there is no novelty in it. But isn’t how the text generation need to work? If we input a word, then it generates a text that’s most likely to occur rather than generate random text every time. (I said random text because we’re selecting a random next word/char).
Why should we have diversified text generation for a given input word?
May be I am missing the big picture of how text generation is useful??

Hey @Narasimhan,
I suppose there is a small confusion.

The other inputs are not randomly selected, they are selected based on the probability distribution that the softmax outputs. So, we begin with same input, i.e., zeroes, we get the same probability distribution, from which, we will select the same input for the next time-step (if we are always choosing the most likely character/word).

This is wrong. First of all, in “sampling novel sequence”, we don’t input a word, we just input a vector of zeros. Secondly, our aim is to generate text that spans over the probability distribution followed by the training set, and not just to generate the most likely example from the training set.

This becomes more clear when we take into context an application of this, for instance, music generation. For this application, would you like only a single music sequence or multiple music sequences from your trained model?

I hope this helps you.