Pretraining decoder-only models on sequence modelling

gursi26 · August 15, 2023, 2:06am

As mentioned in this article and the original GPT paper (here), the transformer decoder-only architecture removes the encoder blocks as well as the second multi-head attention in the decoder blocks (which was used to attend to tokens from the encoder). Therefore, the decoder block is simply a masked multi-head attention followed by a linear layer.

My question has two parts;

Are the add and norm layers kept from the original transformer or removed? This is is not directly mentioned in the paper but is it implied?
When this model is pretrained on sequence modelling and then finetuned on a supervised seq2seq task, the encoder would have to be reattached. Does this mean a untrained, multi-head attention block is inserted into the already pretrained decoder after the first masked attention and linear layer (since the model would now have to attend to tokens from the encoder block)?

arvyzukai · August 21, 2023, 2:36pm

Hi @gursi26

It is mentioned:

Note the “+” and “Layer Norm”

I’m not sure what you mean - the encoder is never “reattached” - if you had to “reattach” the encoder, the whole model would need to be re-trained from scratch. (Just adding the “Cross Attention” would not fit by design.)
Usually if you finetune the decoder-only model then you finetune one or a handful of layers (current or added) but adding the multiple encoder blocks would not “fit” by design and if it even fitted by design, then finetuning would not be very different from training from scratch.

Cheers

Topic		Replies	Views
General Understanding of Transformer Encoder and Decoder blocks NLP with Attention Models week-3	7	733	January 22, 2024
Transformer decoder architecture in course 2 NLP with Attention Models week-2	11	417	April 30, 2024
Questions about transformer architecture Generative AI with Large Language Models ai-discussions	1	37	October 8, 2024
Transformer Model Decoder Question Sequence Models	1	444	July 15, 2023
Transformer Architecture NLP with Sequence Models week-4	2	219	May 22, 2024

Pretraining decoder-only models on sequence modelling

Related topics