Hi @Mubsi and @Shubham_Kumar25
If I understand correctly, Shubham is asking how including </s> token changes things so that probabilities of all sentence lengths sum to one.
When I was doing the course, I had the same question. For me, actual numbers are more intuitive, so I’ll try to explain it with them.
Let’s consider all the possible 2-“word” and 3-“word” sentences that can be constructed from the letters ‘a’ and ‘b’ (a very simple language ):
Here, on the left side, you can see that when we don’t have a </s>
token, probabilities sum to 1 for each length, but not overall. On the right side, with the inclusion of the </s>
token, the probabilities of 2-word and 3-word sentences sum to 0.38.
Note: here:
p(aa)=p(a|<s>
) * p(a|a) = 0.5 * 0.5 = 0.25 # without </s>
p(aa)=p(a|<s>
) * p(a|a) * p(</s>
|a) = 0.5 * 0.3125 * 0.375 = ~0.06 # with </s>
I don’t have a formal proof that all lengths (including 4, 5, … million, … \infty) of sentences would sum to 1, but you can get an intuitive understanding by extending the sentences to 4 words:
Now, the probability without the </s>
token is 3, while with the </s>
token, the probability is 0.46 - it increased slightly. So, intuitively, you can expect a decaying increase for all sentences lengths (up to infinity).
I hope that helps