I can't quite understand the transformer structure

Hi @Kerem_Boyuk

I would offer you to go through a very similar recent thread and in particular to try to understand this picture (of Scaled Multi-Head Dot Product Attention):

In my head, it is somewhat similar to the lstm-based attention:

The main difference that I see is that in the lstm-based encoder the hidden-states are the result of lstm network, while the “hidden states” of the transformer’s encoder are the dot products of each token transformed embeddings.

Anyways, if you hard understanding the first picture, feel free to ask question about specific sections.

Cheers

1 Like