# Beam Search Error Analysis 2 questions

Hi,

About the beam search error analysis, I have two questions:

1. The first question is what many others have asked. How would we calculate y* through the RNN? Does Andrew say we give the y* to the model and get the value p(y|x)?

2. How would beam search chose y^ when p(y*|x) > p(y^|x)? The search chooses the B highest probabilities at each step how would it make that mistake? could it be that y^ has the highest probability in earlier steps but in the end y* has the highest and we just lost choosing y* in the earliest steps?

1. At each step, RNN produces probability of next word given the current word and previous context. To calculate P(y*|x), instead of taking the argmax of this output (which is what youâd do when computing P(y^|x) ), take the probability entry for human generated word.
2. Going back to the lecture, if the model had chosen visits instead of visited, then the overall translation couldâve had a higher probability than y^. It so happened that the model found visited to be the most likely word based on the words seen so far.

I suppose Balaji covers perfectly. I may want to add a few things from a conditional probability view points.

P(\hat{y}|x) is actually P(\hat{y}^{<1>},\hat{y}^{<2>}, ...,\hat{y}^{<T_y>}|x) . (and same to P(y^*|x).)

In general, it can be written as,

\begin{aligned} &P(y^{<1>},y^{<2>}, ..., y^{<T_y>}|x) = \\ &P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>}\cdot )\cdot P(y^{<3>}|x,y^{<1>},y^{<2>})..\cdot P(y^{<T_y>}|x,y^{<1>},y^{<2>}, ..,y^{<T_y-1>})\end{aligned}

Itâs a multiplication of a conditional probability of each step.

And, the challenge is, as Balaji pointed, in the early phase, we needs to select the best one (or B) from a conditional probability of a few steps like this. P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>})
But, you can see this is a very small part of the final probability as a sentence. Selecting a different path (combination of different words) may get the higher score at the end. In this sense, Andrew said that âBeam search is at faultâ.

To avoid this situation, we may want to keep slightly larger B, but, of course, itâs depending to the computational resources and expected time.

1 Like

That makes total sense. As i asumeed p(y^|x) wins in earlier steps but then at the end p(y*|x) has a higher value. However, Iâm still very unclear how we get the value of p(y*|x).

To simplify, let think B=1 case.

At a step t,

• In RNN cell, a probability distribution list for an entire vocabulary is created. Then, pick up the word which has the highest probability. Thatâs P(\hat{y}^{<t>}|x, \hat{y}^{<1>},..\hat{y}^{<t-1>})
• Then, calculate P(\hat{y}|x) = P(\hat{y}^{<1>}\hat{y}^{<2>},..,\hat{y}^{<t>}|x) with using the cell state in the previous step.

Then, it is no longer an elegant math world, but a debug activity. The key point is we have a probability distribution list for an entire vocabulary.
Now, we have y^* as a human defined label, which is be a list of word. At the step t, the word to be selected can be written as y^{*<t>}. And, that string can be converted into the index for a vocabulary, which is in the probability distribution list created in the step t. So, we can pick up P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) from the probability distribution list.

Of course, there are multiple ways for this debugging. The important thing is we have a list of word ID as a label created by a human. So, logging of probability distribution in each step should work. Or pass this word ID, and extract P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) is the another way. If you want to see P(y^{*<t>}|x,y^{*<1>},..y^{*<t-1>}), the way Balaji suggested is the one.

To calculate P(y*|x), instead of taking the argmax of this output (which is what youâd do when computing P(y^|x) ), take the probability entry for human generated word.

In any cases, you need to modify your RNN code with keeping trained weights.

1 Like

Hello Everyone!

In the Error Analysis in Beam Search video, at 5:31, Prof. Andrew told that y^{*} can be equal to or less than \hat{y}. If both are equal, it means the modelâs translation is good and neither RNN nor Beam Search is at fault. I believe Prof. Andrew unintentionally used the word âequalâ. right?

Best,
Saif.

Please consider the problem below, after generating all candidates (few dummy scores shown below):
When we use argmax, index of the 1st maximum is captured.

In [1]: import numpy as np

In [2]: ps = np.array([1/5, 2/5, 2/5])

In [3]: np.argmax(ps)
Out[3]: 1


So, if it happens that both y^{*} and \hat{y} were equally likely and the model picked \hat{y} over y^{*}, thatâs still unacceptable. Since P(y^{*}|x) should be the highest, RNN is at fault.

I recommend the correction be done inside case 2 to make it \le instead of < like this:
But RNN predicted P(y^{*}|x) \le (\hat{y}|x)

Thoughts?

Yeah, I got you. Thanks for clarifying my doubtsâŚ

And, yes, I agreed that â¤ should be used, instead of <, in case 2, in this Error Analysis in Beam Search video, at 5:44.

Please notify the staff via a ticket.
Thanks.

Git issue submittedâŚ