Inference with Transformer Decoder

Hello,

Thank you very much for the insightful lecture about transformer architecture! We learned about the transformer architecture and even built a small transformer model in the programming exercise. I understand that during training we use a look-ahead mask to prevent the decoder from seeing future tokens. However, I am still not fully clear on how inference works in practice, especially at the transformer decoder part.

Is the decoder usually run step by step, starting with only the start token (then add padding tokens for the rest), then generating the next token one by one until the end token is reached? How to do beam search decoding with transformer decoder?

Could you also suggest good resources where I can learn more about transformer decoder inference (greedy search and beam search), possibly the ones with code?

Thank you!

In summary during inference yes you feed token by token. Padding is not added but the model keeps growing till the end sequence character.

Two types of search greedy, chooses the highest probability token and beam, choosing a number of candidates with highest probability.