Linear Interpolation

In week 3, one of the smoothing methods taught is linear interpolation. But in the lecture it seems like we are using the trigram probability to estimate the same trigram probability. In my opinion, the trigram probability should be estimated as a linear combination of the bigram and unigram probabilities.

Please let me know your thoughts on this.

Hi @Naman_Chhibbbar

It really depends on your application. Sometimes one method leads to better results over the other and it’s hard to guess beforehand (and tokenization is a factor here, for example, for a subword model would care less about bigram and unigram probs, character level model would probably would not care about the bigrams and unigrams at all). So, at the end what matters which method leads to better results.


Hey @arvyzukai, thank you for replying!

I think you misunderstood my question. What I am trying to ask here is why are we estimating the trigram probability by using the trigram probability in the linear combination of lower order n-grams, as shown in the lecture.

Please let me know if I am missing something.

Hi @Naman_Chhibbbar

Ah… I think you’re mixing-up the back-off and interpolation.

For the back-off when a trigram is missing we can use lower level n-grams to estimate the trigram probability (instead of it being 0, thus smoothing). I think this is the case you have in mind.

For the interpolation we use lower level n-grams to smooth the trigram probability. And as I mentioned, it really depends on your application if you want to do that or not. In other words, trigram probabilities that you have at the end (of training) only matter for your application because the trigram model is “crude” anyways.

But also notice, that interpolation would also work with the back-off too:

Here, if the P(w_n | w_{n-2}w_{n-1})=0 then the trigram probability would be the sum of lower level n-grams.


Got it, thanks!