We add n-1 start of sentence tokens for n-gram model . But why is it the case that we add only one end of sentence token ? We can’t we add only one start of sentence token either ?
I am not a mentor for this course, but I suspect the reason is that everything has to be padded to the same length. It would be important to be consistent in how you add the padding.
Thank you for helping me. But , why do we pad N-gram with n-1 start tokens which implies that number of start tokens would not be consistent. For instance, for 3-gram , we will prefix 2 start of sentence tokens and for 5-gram we’ll use 4 SOS tokens.
Sorry, I do not know the details. I am not a mentor for this course.
In an n-gram model we predict the next token (so that is why we need only one end of sequence token).
For example, if we model a language with a 5-gram, then we want to calculate all P(w_{i}|(w_{i-1}, w_{i-2}, w_{i-3}, w_{i-4})).
So, in order to do that we only need 1 end of sequence token:
- P(<eos>|(w_4, w_3, w_2, w_1))
But, to predict the first word, we need to pad the sequence:
- P(w_1|(sos, sos, sos, sos)).
An overly simple concrete example - imagine we have a corpus of just two sentences:
[I, very, love, learning, about, ngram, models]
[Some, other, very, interesting, sentence]
Then the probabilities (without smoothing) for the first word are:
- P("I"|(sos, sos, sos, sos)) = 0.5
- P("Some"|(sos, sos, sos, sos)) = 0.5
For the end of sequence probabilities to make sense we should have more examples, but in this case the actual probabilities do not matter and are only to illustrate the point:
- P(eos|(learning, about, ngram, models)) = 1
- P(eos|(other, very, interesting, sentence)) = 1
In other words, we do not need to model P((<eos>, <eos>)|(w_3, w_2, w_1)) (also note this wouldn’t be an n-gram model) because it doesn’t make sense - end of sequence is “end of sequence” . While on the other hand, to know the probability of the first word, we cannot go other way around but to have n-1 sos tokens.
Cheers
P.S. Also check this post which explains why adding eos token helps with probabilities.