I can't quite understand the transformer structure

Hey @arvyzukai
Thank you for your answer.

Maybe I didn’t explain the point that I don’t fully understand.
For a long time, I tried to understand the transformers structure by browsing the blogs, and of course, your answers helped a lot.
From the very beginning, I tried to understand transformers as an example from the lstm-based encoder and decoder structure.
The 3rd answer in this blog(neural networks - What exactly are keys, queries, and values in attention mechanisms? - Cross Validated) was very helpful for my intuition development. The sharp lines of your answers and what I understood as a result of this blog are as follows:

  1. I can compare the encoder exactly to the lstm encoder.
  • 1.1-In addition to embedding, the data is processed sequentially in the list. In order to close this gap, the transformer adds different data to each data, the order of which is important. This video explains the whole thing : https://www.youtube.com/watch?v=dichIcUZfOw

  • 1.2- The blog answer I gave at the beginning nicely explains which values ​​Q,K and V are in lstm-based. While thinking about this answer, I saw that it was a simple lstm operation, where the Q value came not from the decoder but from the same encoder lstm that did the operation. In other words, if you get the Q value from the input in the transformer, this mechanism works like an encoder lstn and produces a hidden state. So now you have a semantic equivalent of the input.

  1. This is how the first Multi-Head Attentions of the Decoder and Encoder layers work in the Transformer architecture. In other words, they produce hidden state just like the simple lstm layers found in the lstm-based encoder decoder. The differences are positional encoding and parallelized operations.

  2. The second Multi-Head Attention of the decoder works like the cross-attention operation, that is, the attention layer between the lstm-based encoder and the decoder. The Q value comes from the hidden decoder stat, and the K and V values ​​come from the Encoder hidden state, so cross attention and the result are output. I think we can solve this by looking at the same graph.

“Here, the query is from the decoder hidden state, the key and value are from the encoder hidden states (key and value are the same in this figure). The score is the compatibility between the query and key, which can be a dot product between the query and key (or other form of compatibility). The scores then go through the softmax function to yield a set of weights whose sum equals 1. Each weight multiplies its corresponding values to yield the context vector which utilizes all the input hidden states.”

One of the most important things I understood was that just like the lstm-based decoder produces a word in each attention cycle, the transformer decoder processes the whole sentence by adding a word to its output according to the output probabilities. Transformer emulates exactly the lstm-based encoder-decoder system.


Source: KiKaBeN - Transformer’s Encoder-Decoder

To sum up, if the Q values ​​come from the same layer, it produces a hidden state like the lstm layer. This hidden state serves to semantic encode the input for the model. If the Q value comes from the decoder, and the K and V values ​​come from the encoder, a single word is created thanks to an attention-layer, just like in the lstm-based model. And these words are added and re-evaluated every cycle to determine the probability of the next word until .

If you still think I’m wrong with this intuition, feel free to warn me.

I hope this will be helpful for others to understand as well. :smile: