In the first step of Beam Search, how are the set of first words chosen since the first words have no prior input to base their conditional probability? Does the algorithm check for top N most occuring words where N is the beam width? I don’t think this is touched upon in the course.
In the French to English translation example,
Jane is visiting Africa in September
, how does the Beam Search know it has to pick [in, jane, september]
for the first 3 words where 3 is the beam width?
The beam search algorithm selects the first set of words based on the conditional probability of each word in the vocabulary given in the input sentence. It uses the decoder’s context vector to predict the most likely first words. Instead of selecting the most frequent words, it evaluates the probability of each possible word being the first word and selects the top-N (where N is the beam width) words with the highest probability.
So, Beam Search chooses words like [in, Jane, September] based on the neural network’s prediction of the most likely first word(s) given the input. The probability of each word is derived from the encoder-decoder model’s predictions, and Beam Search keeps track of the top N words and incrementally builds on them for subsequent words.
1 Like
The beam search algorithm selects the first set of words based on the conditional probability of each word in the vocabulary given in the input sentence. It uses the decoder’s context vector to predict the most likely first words.
I required a clearer understanding on this. After some research, here is my understanding:
- The encoder converts the input into embeddings which is passed as a context vector to the decoder
- The decoder is initialized, typically with the final hidden states of the encoder. The initial token (often
<s>
or a similar start token) is used as the first input to the decoder.
- The decoder outputs the logits from the initial start token and then the logits are converted into probabilities for all the words in the English vocabulary
- Top N most probable words are chosen from this distribution where N is the beam width.
The decoder is trained using the “Teacher Forced” method where the correct target translation is provided as input as opposed to the predicted word in the sequence. This helps in better learning, thus given an encoded context vector, it knows exactly (at least in theory) what word to predict next in the sequence.
I hope my understanding is correct
1 Like
Your understanding is correct! To complement that, using teacher forcing during training helps the model learn the correct sequence. This means that during inference, the model is better at predicting the next word in the sequence when given a coded context. However, during inference, the predicted word is used as input for the next step instead of the correct target (because you don’t have the ground truth). This can lead to errors, which is why beam search is used to explore multiple possibilities instead of just one.
After selecting the top-N words in the first step, each is used as a new starting point to predict the next word. The decoder will consider the context (the hidden states) along with the first word to generate a probability distribution for the second word, and so on. Each partial sequence is scored by its cumulative probability. The search continues until the desired sequence length or an end token is reached.
In the French-to-English example, after selecting the top three words for the first position (“in,” “Jane,” “September”), beam search would predict the next word for each candidate by considering the context of both the input sentence and the chosen first word. As the sequence progresses, it keeps track of the most probable translations, discarding unlikely candidates.
1 Like