Predicting Next Set of Tokens in Decoder Model

YashiP · August 3, 2023, 9:29pm

Hi All,
I am not able to understand what is the role of a masked Token in Decoder Only Model?

a. Is Masked Token Prediction during Inference Time where the model is already learnt and it uses the learnt weights to predict the masked Token ?
b. If not a, then is it a part of pretraining process? In this case When the word itself is not known, how are the weights of the model updated during the pre training process and how is the loss function calculated?

Thanks

saifkhanengr · August 4, 2023, 4:38am

Let’s say we have a sentence, Jane visits Africa in September. We give this input to both the encoder and decoder. The encoder can see all the words and then do some processes and feed its output to the decoder. At this point, we masked all the words of the input sentence for the decoder and give it only the first word, “Jane”. It will predict the 2nd word and then the 3rd and so on. If the decoder predicts the 2nd word (visits) wrongly, then we unmask the 2nd word to give it the correct word (that is called teacher forcing). If you want to dive deep, I recommend taking NLP course 4 which teaches transformer.

YashiP · August 4, 2023, 5:27am

Thanks Saif for your response.

It looks like decoder and encoder work together.

Even though there is a decoder only Architecture mentioned in one of the lectures.

Is the process you mentioned, applicable at inference time or it is the part of model training.

By giving an example , in your response, the first word that is fed as an input to the Decoder model, is that the decoded word or the original word .

I am curious to find the mathematics behind the way the decoder model is trained using masked tokens.

I will look into the NLP course 4 you suggested.

Regards.

saifkhanengr · August 4, 2023, 5:37am

It’s a training process.

The first word is the original word but then 2nd, 3rd, and the rest words come from the decoder itself (or original words in the case of teacher forcing method).

YashiP · August 4, 2023, 1:09pm

Got it.

Thank you.

YashiP · August 4, 2023, 1:55pm

Hi Saif,
Can you please confirm if the following course is what you are recomending?

Thanks,
YashiP

saifkhanengr · August 4, 2023, 2:05pm

Yes, this one.

saifkhanengr · August 10, 2023, 5:13am

Hello @YashiP!

Please check this reply from @Juan_Olano where he explained how decoder only and encoder only models works.

Topic		Replies	Views
Decoder-only architectures for sequence modelling NLP with Attention Models week-module-2	1	371	August 12, 2023
The tokens that decoder block use Sequence Models week-module-4 , coursera-platform	3	209	April 15, 2024
[Week 4] BERT Pre-Training Concepts Sequence Models coursera-platform	1	507	December 12, 2022
Masked Attention Transformers Sequence Models coursera-platform	6	805	September 27, 2024
Token, Encoder an decoder meaning Generative AI with Large Language Models week-module-1	1	436	July 1, 2023

Predicting Next Set of Tokens in Decoder Model

Related topics