Decoder-only Transformer Training/Inference

How is the decoder-only architecture able to attend to all positions of the original input AND the subsequent predictions at each time step without the presence of a decoder? My understanding is this:

The original input sequence is fed into the model, and then the model’s own predictions on each time step are appended to this sequence and fed back into the model for subsequent predictions.

Can someone confirm that this is correct? Additionally, if the token size limit is 2048 (for example), this means the input + the output length combined must be less than 2048…

As far as I remember you are mostly right, about the tokens size is referred to the input token size!

Hi @Max_Rivera

The model predicts all positions, so if your output size is 8, then you get 8 prediction of vectors. In your application you usually choose to pluck out the n’th token (the next token - so like in your example, after three words as input, you get predictions for all 8 outputs, but you care only about the 4th prediction).

The training (updating model weights) of this is achieved with the help of lower triangle matrix. For example, if you had no idea at what words to attend, your prior attention could look like Continious Bag Of Words case:

where each token has equal influence on prediction.
But after some training your attention might look like:
When each token aggregates more or less of the previous tokens (and self) values.

So what is important to understand, is that the decoder outputs all the predictions and you choose to care about the next one during inference.


P.S. this example for simplicity is 8 of time window (window size, block size, token size … other names for the same thing), but you could easily imagine it in 2048.

1 Like

I think we are on the same page… the token limit is referred to the input token limit… and the input on each timestep is the original context + the output of the previous timestep (as shown in diagram). Is that accurate?