In the lecture during the third week, Prof Andrew Ng mentioned that there are 10,000 to the power of 10 possible ways to construct a sentence. He explained that “Beam Search” is utilized instead of “Greedy Search” for this purpose. My understanding is that, like Greedy Search, Beam Search also predicts words sequentially. However, Beam Search seems to focus on the probability of entire sentences rather than individual words, as Greedy Search does. Is this the primary reason for choosing Beam Search over Greedy Search? Does it have nothing to do with the computational resources required to process 10 to the power of 10,000 possible sentence constructions?
Greedy search is likely to pick an incorrect output since it strictly bases its decision based on the most probably next word based on the words seen so far. He explains the weakness of greedy search where it’s better to pick a better translation of Jane is visiting Africa in September
over the alternate Jane is going to be visiting Africa in September
(please see the video).
Assuming that the translation is around 10 words, enumerating all 10 word sequences with a vocabulary of 10K words will make the search space 10000^{10}.
Beam search sits between the 2 extremes and uses the hyperparameter called beam width that helps land on a good translation. When beam width is 1, it’s the same as greedy search. With higher beam width, you consider more candidates and then score all candidates with length penalty for picking the final output (see this as well).
As one might expect, increasing the beam width will have a direct impact on computation requirements both in terms of memory and processing.
Got it now, thank you!