I don't understand the transformer's decoder

When creating a post, please add:

First of all, the encoder as I understand it is like this.

The sentence “i am a boy” goes into the input of the encoder, tokenization → vector → multihead attention → vector values change → feed forward network → token comes out with well-established context.

The encoder as I understand it is above.

And now the decoder part is the problem.

The decoder initially receives SOS as input. And I don’t quite understand the masked multi-head attention here.

At the beginning of the decoder, there is only one sos token in the masked multihead attention layer. How do mask and predict the next value?

How does masked multi-head attention occur with just one SOS token?

I think its a good idea to check the NLP specialization, especially course 4 which explains the transformer architecture.

Hi @Goomin

The decoder is also composed of a stack of identical layers.
In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

I am sharing the Transformer pdf. kindly go through it which explains the same part I explained you here. in case you still have doubt, feel free to ask doubt!

Transformer.pdf (2.1 MB)

Regards
DP