Transformer Model Decoder Question

gursi26 · July 15, 2023, 4:07pm

In the decoder block of the transformer model, a feed forward linear layer is used after the multi-head attention layer.

Since the length of the output of the multi-head attention model changes as the input length changes, how does this input to a fixed size feed forward layer, which only accepts inputs of a fixed size?

Juan_Olano · July 15, 2023, 4:57pm

Hi @gursi26 ,

If you check out the detail of the model, there’s a padding and a mask. The padding completes the size to the required one, and the mask instructs the process to just look at the unmasked part, which is the part that is being inputted into the decoder. If the decoder’s input is, to say anything, 512 tokens, and we are just starting, then we would probably see 2 tokens with data and 510 with the padding and masking.

Hope this helps!

Topic		Replies	Views
Does Number of Fully connected neural networks changes in transformer architechture based on max length input size? Sequence Models coursera-platform	1	502	May 5, 2023
Masked Attention Transformers Sequence Models coursera-platform	6	796	September 27, 2024
Mask Multi Head Attention Sequence Models coursera-platform	5	607	May 2, 2022
I don't understand the transformer's decoder Generative AI with Large Language Models week-1	2	193	July 24, 2024
Pretraining decoder-only models on sequence modelling NLP with Attention Models week-3	1	425	August 21, 2023

Transformer Model Decoder Question

Related topics