Decoder-only Transformer Training/Inference

Max_Rivera · June 6, 2023, 2:59am

How is the decoder-only architecture able to attend to all positions of the original input AND the subsequent predictions at each time step without the presence of a decoder? My understanding is this:

The original input sequence is fed into the model, and then the model’s own predictions on each time step are appended to this sequence and fed back into the model for subsequent predictions.

Can someone confirm that this is correct? Additionally, if the token size limit is 2048 (for example), this means the input + the output length combined must be less than 2048…

gent.spah · June 6, 2023, 7:35am

As far as I remember you are mostly right, about the tokens size is referred to the input token size!

arvyzukai · June 6, 2023, 8:16am

Hi @Max_Rivera

The model predicts all positions, so if your output size is 8, then you get 8 prediction of vectors. In your application you usually choose to pluck out the n’th token (the next token - so like in your example, after three words as input, you get predictions for all 8 outputs, but you care only about the 4th prediction).

The training (updating model weights) of this is achieved with the help of lower triangle matrix. For example, if you had no idea at what words to attend, your prior attention could look like Continious Bag Of Words case:

where each token has equal influence on prediction.
But after some training your attention might look like:

When each token aggregates more or less of the previous tokens (and self) values.

So what is important to understand, is that the decoder outputs all the predictions and you choose to care about the next one during inference.

Cheers

P.S. this example for simplicity is 8 of time window (window size, block size, token size … other names for the same thing), but you could easily imagine it in 2048.

Max_Rivera · June 6, 2023, 12:40pm

I think we are on the same page… the token limit is referred to the input token limit… and the input on each timestep is the original context + the output of the previous timestep (as shown in diagram). Is that accurate?

Topic		Replies	Views
Transformer Model Decoder Question Sequence Models coursera-platform	1	447	July 15, 2023
Predicting Next Set of Tokens in Decoder Model Generative AI with Large Language Models week-module-1	7	579	August 10, 2023
4 Questions on Transformers Sequence Models coursera-platform	2	1131	April 23, 2023
Masked Attention Transformers Sequence Models coursera-platform	6	816	September 27, 2024
Transformer decoder architecture in course 2 NLP with Attention Models week-module-2	11	532	April 30, 2024

Decoder-only Transformer Training/Inference

Related topics