Hi All,
I am not able to understand what is the role of a masked Token in Decoder Only Model?
a. Is Masked Token Prediction during Inference Time where the model is already learnt and it uses the learnt weights to predict the masked Token ?
b. If not a, then is it a part of pretraining process? In this case When the word itself is not known, how are the weights of the model updated during the pre training process and how is the loss function calculated?
Let’s say we have a sentence, Jane visits Africa in September. We give this input to both the encoder and decoder. The encoder can see all the words and then do some processes and feed its output to the decoder. At this point, we masked all the words of the input sentence for the decoder and give it only the first word, “Jane”. It will predict the 2nd word and then the 3rd and so on. If the decoder predicts the 2nd word (visits) wrongly, then we unmask the 2nd word to give it the correct word (that is called teacher forcing). If you want to dive deep, I recommend taking NLP course 4 which teaches transformer.
The first word is the original word but then 2nd, 3rd, and the rest words come from the decoder itself (or original words in the case of teacher forcing method).