Pretraining decoder-only models on sequence modelling

As mentioned in this article and the original GPT paper (here), the transformer decoder-only architecture removes the encoder blocks as well as the second multi-head attention in the decoder blocks (which was used to attend to tokens from the encoder). Therefore, the decoder block is simply a masked multi-head attention followed by a linear layer.

My question has two parts;

  1. Are the add and norm layers kept from the original transformer or removed? This is is not directly mentioned in the paper but is it implied?
  2. When this model is pretrained on sequence modelling and then finetuned on a supervised seq2seq task, the encoder would have to be reattached. Does this mean a untrained, multi-head attention block is inserted into the already pretrained decoder after the first masked attention and linear layer (since the model would now have to attend to tokens from the encoder block)?

Hi @gursi26

It is mentioned:
Note the “+” and “Layer Norm”

I’m not sure what you mean - the encoder is never “reattached” - if you had to “reattach” the encoder, the whole model would need to be re-trained from scratch. (Just adding the “Cross Attention” would not fit by design.)
Usually if you finetune the decoder-only model then you finetune one or a handful of layers (current or added) but adding the multiple encoder blocks would not “fit” by design and if it even fitted by design, then finetuning would not be very different from training from scratch.