Predicting Next Set of Tokens in Decoder Model

Hi All,
I am not able to understand what is the role of a masked Token in Decoder Only Model?

a. Is Masked Token Prediction during Inference Time where the model is already learnt and it uses the learnt weights to predict the masked Token ?
b. If not a, then is it a part of pretraining process? In this case When the word itself is not known, how are the weights of the model updated during the pre training process and how is the loss function calculated?


Let’s say we have a sentence, Jane visits Africa in September. We give this input to both the encoder and decoder. The encoder can see all the words and then do some processes and feed its output to the decoder. At this point, we masked all the words of the input sentence for the decoder and give it only the first word, “Jane”. It will predict the 2nd word and then the 3rd and so on. If the decoder predicts the 2nd word (visits) wrongly, then we unmask the 2nd word to give it the correct word (that is called teacher forcing). If you want to dive deep, I recommend taking NLP course 4 which teaches transformer.

Thanks Saif for your response.

It looks like decoder and encoder work together.

Even though there is a decoder only Architecture mentioned in one of the lectures.

Is the process you mentioned, applicable at inference time or it is the part of model training.

By giving an example , in your response, the first word that is fed as an input to the Decoder model, is that the decoded word or the original word .

I am curious to find the mathematics behind the way the decoder model is trained using masked tokens.

I will look into the NLP course 4 you suggested.


It’s a training process.

The first word is the original word but then 2nd, 3rd, and the rest words come from the decoder itself (or original words in the case of teacher forcing method).

Got it.

Thank you.

Hi Saif,
Can you please confirm if the following course is what you are recomending?


Yes, this one.

Hello @YashiP!

Please check this reply from @Juan_Olano where he explained how decoder only and encoder only models works.

1 Like