For a n-grams model, why do we use n-1 start tokens but only 1 end token?
Hi, mukund.
Think about what we are trying to predict - the next token. For example, given previous 3 tokens what is the next token? When we have the <EOS> token there is no need to predict the next token after it because it would be another <EOS> token.
In contrast, [<SOS>, <SOS>, ‘you’] is different from [<SOS>, ‘are’, ‘you’] and should produce different probabilities for the next token.