I suppose Balaji covers perfectly. I may want to add a few things from a conditional probability view points.
P(\hat{y}|x) is actually P(\hat{y}^{<1>},\hat{y}^{<2>}, ...,\hat{y}^{<T_y>}|x) . (and same to P(y^*|x).)
In general, it can be written as,
\begin{aligned}
&P(y^{<1>},y^{<2>}, ..., y^{<T_y>}|x) = \\
&P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>}\cdot )\cdot P(y^{<3>}|x,y^{<1>},y^{<2>})..\cdot P(y^{<T_y>}|x,y^{<1>},y^{<2>}, ..,y^{<T_y-1>})\end{aligned}
Itâs a multiplication of a conditional probability of each step.
And, the challenge is, as Balaji pointed, in the early phase, we needs to select the best one (or B) from a conditional probability of a few steps like this. P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)\cdot P(y^{<2>}|x,y^{<1>})
But, you can see this is a very small part of the final probability as a sentence. Selecting a different path (combination of different words) may get the higher score at the end. In this sense, Andrew said that âBeam search is at faultâ.
To avoid this situation, we may want to keep slightly larger B, but, of course, itâs depending to the computational resources and expected time.