Your understanding is correct! To complement that, using teacher forcing during training helps the model learn the correct sequence. This means that during inference, the model is better at predicting the next word in the sequence when given a coded context. However, during inference, the predicted word is used as input for the next step instead of the correct target (because you don’t have the ground truth). This can lead to errors, which is why beam search is used to explore multiple possibilities instead of just one.
After selecting the top-N words in the first step, each is used as a new starting point to predict the next word. The decoder will consider the context (the hidden states) along with the first word to generate a probability distribution for the second word, and so on. Each partial sequence is scored by its cumulative probability. The search continues until the desired sequence length or an end token is reached.
In the French-to-English example, after selecting the top three words for the first position (“in,” “Jane,” “September”), beam search would predict the next word for each candidate by considering the context of both the input sentence and the chosen first word. As the sequence progresses, it keeps track of the most probable translations, discarding unlikely candidates.