Need clarity on the probability of trigram with the help of a simple example

Corpus: I am happy, are you? Yes, I am

What is the probability of trigram: P(happy | I am) = ?

To my understanding:
C (I am happy) = 1
C (I am) = 1

I assumed C(I am) = 1, because, to make the sequence eligible for trigram count, it has to be followed by some word, isn’t it.?

Since, the “I am” occurring at the end of the corpus is not followed by any other word, thus it may not qualify for a count increment in denominator!

Is my understanding correct??
Or, C (I am) to be used in denominator should be considered 2?

Hi @Shaleen_Srivastava

That is the precise reason for special tokens (start-of-sentence - <s>, and end-of-sentence - </s>). So, you corpus becomes <s> <s> I am happy, are you? Yes, I am </s>.

In this case the denominator stays the same as it should be - 2. And the numerator is 1 for both (P(happy | I am) and P(</s>| I am).


1 Like