In one of the refinements mentioned by Prof. NG, we were the dividing the objective function (the product of the probabilities) by the length of the sequence. This was done since multiplying more probabilities (which are numbers less than 1) will decrease the probability of longer sentences over the shorter ones which can make the model favour shorter translations. But, wont this problem amplify when we divide by some power (alpha) of the length? This is because shorter sequence has less length than larger sequence, and the multiplication of probabilities was smaller for longer sequence, and thus if we divide by length, we are further increasing the gap between the values for shorter and longer sequence (the value of shorter sequence goes up since we are dividing by a smaller number and for longer sequence it goes down). So, will length normalization really help refining beam search?
1 Like
This threw me as well at first (which is why I searched for an answer).
The reason this works is that log probabilities are all negative because \text{log}(\text{number}<1) < 0.
So for example if without normalisation I have two sentences to choose from: a 1-word sentence with log probability of -4; and a 5-word sentence with log probability of -10 then my algorithm would pick the 1-word sentence.
With normalisation (ignoring the \alpha factor introduced in lectures) the normalised log-probability for the one word sentence would still be -4 but the 5-word sentence would have a normalised log probability of -\frac{10}{5} = -2 which is bigger than -4, so beam search would pick the longer sentence in this case.
2 Likes