Beam Search Error Analysis 2 questions

Sahar_Nasiri · July 26, 2022, 11:04pm

Hi,

About the beam search error analysis, I have two questions:

The first question is what many others have asked. How would we calculate y* through the RNN? Does Andrew say we give the y* to the model and get the value p(y|x)?
How would beam search chose y^ when p(y*|x) > p(y^|x)? The search chooses the B highest probabilities at each step how would it make that mistake? could it be that y^ has the highest probability in earlier steps but in the end y* has the highest and we just lost choosing y* in the earliest steps?

balaji.ambresh · July 27, 2022, 6:36am

At each step, RNN produces probability of next word given the current word and previous context. To calculate P(y*|x), instead of taking the argmax of this output (which is what you’d do when computing P(y^|x) ), take the probability entry for human generated word.
Going back to the lecture, if the model had chosen visits instead of visited, then the overall translation could’ve had a higher probability than y^. It so happened that the model found visited to be the most likely word based on the words seen so far.

anon57530071 · July 27, 2022, 7:17am

I suppose Balaji covers perfectly. I may want to add a few things from a conditional probability view points.

P(\hat{y}|x) is actually P(\hat{y}^{<1>},\hat{y}^{<2>}, ...,\hat{y}^{<T_y>}|x) . (and same to P(y^*|x).)

In general, it can be written as,

\begin{aligned} &P(y^{<1>},y^{<2>}, ..., y^{<T_y>}|x) = \\ &P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>}\cdot )\cdot P(y^{<3>}|x,y^{<1>},y^{<2>})..\cdot P(y^{<T_y>}|x,y^{<1>},y^{<2>}, ..,y^{<T_y-1>})\end{aligned}

It’s a multiplication of a conditional probability of each step.

And, the challenge is, as Balaji pointed, in the early phase, we needs to select the best one (or B) from a conditional probability of a few steps like this. P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>})
But, you can see this is a very small part of the final probability as a sentence. Selecting a different path (combination of different words) may get the higher score at the end. In this sense, Andrew said that “Beam search is at fault”.

To avoid this situation, we may want to keep slightly larger B, but, of course, it’s depending to the computational resources and expected time.

Sahar_Nasiri · July 27, 2022, 2:14pm

That makes total sense. As i asumeed p(y^|x) wins in earlier steps but then at the end p(y*|x) has a higher value. However, I’m still very unclear how we get the value of p(y*|x).
I would appreciate your help.

anon57530071 · July 28, 2022, 1:37am

To simplify, let think B=1 case.

At a step t,

In RNN cell, a probability distribution list for an entire vocabulary is created. Then, pick up the word which has the highest probability. That’s P(\hat{y}^{<t>}|x, \hat{y}^{<1>},..\hat{y}^{<t-1>})
Then, calculate P(\hat{y}|x) = P(\hat{y}^{<1>}\hat{y}^{<2>},..,\hat{y}^{<t>}|x) with using the cell state in the previous step.

Then, it is no longer an elegant math world, but a debug activity. The key point is we have a probability distribution list for an entire vocabulary.
Now, we have y^* as a human defined label, which is be a list of word. At the step t, the word to be selected can be written as y^{*<t>}. And, that string can be converted into the index for a vocabulary, which is in the probability distribution list created in the step t. So, we can pick up P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) from the probability distribution list.

Of course, there are multiple ways for this debugging. The important thing is we have a list of word ID as a label created by a human. So, logging of probability distribution in each step should work. Or pass this word ID, and extract P(y^{*<t>}|x,\hat{y}^{<1>},..\hat{y}^{<t-1>}) is the another way. If you want to see P(y^{*<t>}|x,y^{*<1>},..y^{*<t-1>}), the way Balaji suggested is the one.

To calculate P(y*|x), instead of taking the argmax of this output (which is what you’d do when computing P(y^|x) ), take the probability entry for human generated word.

In any cases, you need to modify your RNN code with keeping trained weights.

saifkhanengr · May 9, 2023, 6:34am

Hello Everyone!

In the Error Analysis in Beam Search video, at 5:31, Prof. Andrew told that y^{*} can be equal to or less than \hat{y}. If both are equal, it means the model’s translation is good and neither RNN nor Beam Search is at fault. I believe Prof. Andrew unintentionally used the word “equal”. right?

Best,
Saif.

balaji.ambresh · May 9, 2023, 12:36pm

Please consider the problem below, after generating all candidates (few dummy scores shown below):
When we use argmax, index of the 1st maximum is captured.

In [1]: import numpy as np

In [2]: ps = np.array([1/5, 2/5, 2/5])

In [3]: np.argmax(ps)
Out[3]: 1

So, if it happens that both y^{*} and \hat{y} were equally likely and the model picked \hat{y} over y^{*}, that’s still unacceptable. Since P(y^{*}|x) should be the highest, RNN is at fault.

I recommend the correction be done inside case 2 to make it \le instead of < like this:
But RNN predicted P(y^{*}|x) \le (\hat{y}|x)

Thoughts?

saifkhanengr · May 9, 2023, 1:55pm

Yeah, I got you. Thanks for clarifying my doubts…

And, yes, I agreed that ≤ should be used, instead of <, in case 2, in this Error Analysis in Beam Search video, at 5:44.

balaji.ambresh · May 9, 2023, 4:57pm

Please notify the staff via a ticket.
Thanks.

saifkhanengr · May 10, 2023, 4:47am

Git issue submitted…

Topic		Replies	Views
Week3 - Beam Search Error Analysis Sequence Models coursera-platform	1	526	April 21, 2022
Beam Search Error Analysis Sequence Models coursera-platform	2	604	September 30, 2021
#C5 3 How to obtain the probability value in Error Analysis in Beam Search Sequence Models week-module-3 , coursera-platform	1	236	March 4, 2024
Beam Search: P* <= P^hat Sequence Models week-module-3 , coursera-platform	3	134	May 24, 2024
Is this a grading error or am I missing something? Sequence Models coursera-platform	1	513	September 8, 2022

Beam Search Error Analysis 2 questions

Related topics