Beam Search Error Analysis 2 questions


About the beam search error analysis, I have two questions:

  1. The first question is what many others have asked. How would we calculate y* through the RNN? Does Andrew say we give the y* to the model and get the value p(y|x)?

  2. How would beam search chose y^ when p(y*|x) > p(y^|x)? The search chooses the B highest probabilities at each step how would it make that mistake? could it be that y^ has the highest probability in earlier steps but in the end y* has the highest and we just lost choosing y* in the earliest steps?

  1. At each step, RNN produces probability of next word given the current word and previous context. To calculate P(y*|x), instead of taking the argmax of this output (which is what you’d do when computing P(y^|x) ), take the probability entry for human generated word.
  2. Going back to the lecture, if the model had chosen visits instead of visited, then the overall translation could’ve had a higher probability than y^. It so happened that the model found visited to be the most likely word based on the words seen so far.

I suppose Balaji covers perfectly. I may want to add a few things from a conditional probability view points.

P(\hat{y}|x) is actually P(\hat{y}^{<1>},\hat{y}^{<2>}, ...,\hat{y}^{<T_y>}|x) . (and same to P(y^*|x).)

In general, it can be written as,

\begin{aligned} &P(y^{<1>},y^{<2>}, ..., y^{<T_y>}|x) = \\ &P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>}\cdot )\cdot P(y^{<3>}|x,y^{<1>},y^{<2>})..\cdot P(y^{<T_y>}|x,y^{<1>},y^{<2>}, ..,y^{<T_y-1>})\end{aligned}

It’s a multiplication of a conditional probability of each step.

And, the challenge is, as Balaji pointed, in the early phase, we needs to select the best one (or B) from a conditional probability of a few steps like this. P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>})
But, you can see this is a very small part of the final probability as a sentence. Selecting a different path (combination of different words) may get the higher score at the end. In this sense, Andrew said that “Beam search is at fault”.

To avoid this situation, we may want to keep slightly larger B, but, of course, it’s depending to the computational resources and expected time.

1 Like

That makes total sense. As i asumeed p(y^|x) wins in earlier steps but then at the end p(y*|x) has a higher value. However, I’m still very unclear how we get the value of p(y*|x).
I would appreciate your help.

To simplify, let think B=1 case.

At a step t,

  • In RNN cell, a probability distribution list for an entire vocabulary is created. Then, pick up the word which has the highest probability. That’s P(\hat{y}^{<t>}|x, \hat{y}^{<1>},..\hat{y}^{<t-1>})
  • Then, calculate P(\hat{y}|x) = P(\hat{y}^{<1>}\hat{y}^{<2>},..,\hat{y}^{<t>}|x) with using the cell state in the previous step.

Then, it is no longer an elegant math world, but a debug activity. The key point is we have a probability distribution list for an entire vocabulary.
Now, we have y^* as a human defined label, which is be a list of word. At the step t, the word to be selected can be written as y^{*<t>}. And, that string can be converted into the index for a vocabulary, which is in the probability distribution list created in the step t. So, we can pick up P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) from the probability distribution list.

Of course, there are multiple ways for this debugging. The important thing is we have a list of word ID as a label created by a human. So, logging of probability distribution in each step should work. Or pass this word ID, and extract P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) is the another way. If you want to see P(y^{*<t>}|x,y^{*<1>},..y^{*<t-1>}), the way Balaji suggested is the one.

To calculate P(y*|x), instead of taking the argmax of this output (which is what you’d do when computing P(y^|x) ), take the probability entry for human generated word.

In any cases, you need to modify your RNN code with keeping trained weights.

1 Like

Hello Everyone!

In the Error Analysis in Beam Search video, at 5:31, Prof. Andrew told that y^{*} can be equal to or less than \hat{y}. If both are equal, it means the model’s translation is good and neither RNN nor Beam Search is at fault. I believe Prof. Andrew unintentionally used the word “equal”. right?


Please consider the problem below, after generating all candidates (few dummy scores shown below):
When we use argmax, index of the 1st maximum is captured.

In [1]: import numpy as np

In [2]: ps = np.array([1/5, 2/5, 2/5])

In [3]: np.argmax(ps)
Out[3]: 1

So, if it happens that both y^{*} and \hat{y} were equally likely and the model picked \hat{y} over y^{*}, that’s still unacceptable. Since P(y^{*}|x) should be the highest, RNN is at fault.

I recommend the correction be done inside case 2 to make it \le instead of < like this:
But RNN predicted P(y^{*}|x) \le (\hat{y}|x)


Yeah, I got you. Thanks for clarifying my doubts…

And, yes, I agreed that ≤ should be used, instead of <, in case 2, in this Error Analysis in Beam Search video, at 5:44.

Please notify the staff via a ticket.

Git issue submitted…