Inference with Transformer Decoder

ruhiyahfw · September 16, 2025, 9:05am

Hello,

Thank you very much for the insightful lecture about transformer architecture! We learned about the transformer architecture and even built a small transformer model in the programming exercise. I understand that during training we use a look-ahead mask to prevent the decoder from seeing future tokens. However, I am still not fully clear on how inference works in practice, especially at the transformer decoder part.

Is the decoder usually run step by step, starting with only the start token (then add padding tokens for the rest), then generating the next token one by one until the end token is reached? How to do beam search decoding with transformer decoder?

Could you also suggest good resources where I can learn more about transformer decoder inference (greedy search and beam search), possibly the ones with code?

Thank you!

gent.spah · September 16, 2025, 1:18pm

In summary during inference yes you feed token by token. Padding is not added but the model keeps growing till the end sequence character.

Two types of search greedy, chooses the highest probability token and beam, choosing a number of candidates with highest probability.

Topic		Replies	Views
Masked Attention Transformers Sequence Models coursera-platform	6	985	September 27, 2024
Transformer decoder architecture in course 2 NLP with Attention Models week-module-2	11	798	April 30, 2024
Inference for NMT NLP with Attention Models week-module-2	11	452	June 23, 2023
Predicting Next Set of Tokens in Decoder Model Generative AI with Large Language Models week-module-1	7	653	August 10, 2023
Decoder-only Transformer Training/Inference Sequence Models coursera-platform	3	715	June 6, 2023

Inference with Transformer Decoder

Related topics