Start and End Tokens

Muhammad_Amir01 · December 13, 2023, 8:39am

We add n-1 start of sentence tokens for n-gram model . But why is it the case that we add only one end of sentence token ? We can’t we add only one start of sentence token either ?

TMosh · December 13, 2023, 6:32pm

I am not a mentor for this course, but I suspect the reason is that everything has to be padded to the same length. It would be important to be consistent in how you add the padding.

Muhammad_Amir01 · December 13, 2023, 7:01pm

Thank you for helping me. But , why do we pad N-gram with n-1 start tokens which implies that number of start tokens would not be consistent. For instance, for 3-gram , we will prefix 2 start of sentence tokens and for 5-gram we’ll use 4 SOS tokens.

TMosh · December 13, 2023, 7:06pm

Sorry, I do not know the details. I am not a mentor for this course.

arvyzukai · December 14, 2023, 8:44am

Hi @Muhammad_Amir01

In an n-gram model we predict the next token (so that is why we need only one end of sequence token).

For example, if we model a language with a 5-gram, then we want to calculate all P(w_{i}|(w_{i-1}, w_{i-2}, w_{i-3}, w_{i-4})).

So, in order to do that we only need 1 end of sequence token:

P(<eos>|(w_4, w_3, w_2, w_1))

But, to predict the first word, we need to pad the sequence:

P(w_1|(sos, sos, sos, sos)).

An overly simple concrete example - imagine we have a corpus of just two sentences:

[I, very, love, learning, about, ngram, models]
[Some, other, very, interesting, sentence]

Then the probabilities (without smoothing) for the first word are:

P("I"|(sos, sos, sos, sos)) = 0.5
P("Some"|(sos, sos, sos, sos)) = 0.5

For the end of sequence probabilities to make sense we should have more examples, but in this case the actual probabilities do not matter and are only to illustrate the point:

P(eos|(learning, about, ngram, models)) = 1
P(eos|(other, very, interesting, sentence)) = 1

In other words, we do not need to model P((<eos>, <eos>)|(w_3, w_2, w_1)) (also note this wouldn’t be an n-gram model) because it doesn’t make sense - end of sequence is “end of sequence” . While on the other hand, to know the probability of the first word, we cannot go other way around but to have n-1 sos tokens.

Cheers

P.S. Also check this post which explains why adding eos token helps with probabilities.

Topic		Replies	Views
Start tokens and End token NLP with Probabilistic Models week-3	1	555	March 22, 2022
Video on Starting and Ending Sentences (Intuition issue) NLP with Probabilistic Models week-2	1	379	September 15, 2023
Adding start sequence tags and perplexity calculation NLP with Probabilistic Models week-3	4	553	August 23, 2023
C2_W3 assignment UNQ_C8 count_n_grams NLP with Probabilistic Models week-3	2	460	July 6, 2023
Assignment 3: Language Models: Auto-Complete NLP with Probabilistic Models week-3	5	293	March 26, 2024

Start and End Tokens

Related topics