In the decoder block of the transformer model, a feed forward linear layer is used after the multi-head attention layer.
Since the length of the output of the multi-head attention model changes as the input length changes, how does this input to a fixed size feed forward layer, which only accepts inputs of a fixed size?
Hi @gursi26 ,
If you check out the detail of the model, there’s a padding and a mask. The padding completes the size to the required one, and the mask instructs the process to just look at the unmasked part, which is the part that is being inputted into the decoder. If the decoder’s input is, to say anything, 512 tokens, and we are just starting, then we would probably see 2 tokens with data and 510 with the padding and masking.
Hope this helps!
1 Like