I can't quite understand the transformer structure

Kerem_Boyuk · August 25, 2023, 11:36am

Hey @arvyzukai
Thank you for your answer.

Maybe I didn’t explain the point that I don’t fully understand.
For a long time, I tried to understand the transformers structure by browsing the blogs, and of course, your answers helped a lot.
From the very beginning, I tried to understand transformers as an example from the lstm-based encoder and decoder structure.
The 3rd answer in this blog(neural networks - What exactly are keys, queries, and values in attention mechanisms? - Cross Validated) was very helpful for my intuition development. The sharp lines of your answers and what I understood as a result of this blog are as follows:

I can compare the encoder exactly to the lstm encoder.

1.1-In addition to embedding, the data is processed sequentially in the list. In order to close this gap, the transformer adds different data to each data, the order of which is important. This video explains the whole thing : https://www.youtube.com/watch?v=dichIcUZfOw
1.2- The blog answer I gave at the beginning nicely explains which values Q,K and V are in lstm-based. While thinking about this answer, I saw that it was a simple lstm operation, where the Q value came not from the decoder but from the same encoder lstm that did the operation. In other words, if you get the Q value from the input in the transformer, this mechanism works like an encoder lstn and produces a hidden state. So now you have a semantic equivalent of the input.

This is how the first Multi-Head Attentions of the Decoder and Encoder layers work in the Transformer architecture. In other words, they produce hidden state just like the simple lstm layers found in the lstm-based encoder decoder. The differences are positional encoding and parallelized operations.
The second Multi-Head Attention of the decoder works like the cross-attention operation, that is, the attention layer between the lstm-based encoder and the decoder. The Q value comes from the hidden decoder stat, and the K and V values come from the Encoder hidden state, so cross attention and the result are output. I think we can solve this by looking at the same graph.

“Here, the query is from the decoder hidden state, the key and value are from the encoder hidden states (key and value are the same in this figure). The score is the compatibility between the query and key, which can be a dot product between the query and key (or other form of compatibility). The scores then go through the softmax function to yield a set of weights whose sum equals 1. Each weight multiplies its corresponding values to yield the context vector which utilizes all the input hidden states.”

One of the most important things I understood was that just like the lstm-based decoder produces a word in each attention cycle, the transformer decoder processes the whole sentence by adding a word to its output according to the output probabilities. Transformer emulates exactly the lstm-based encoder-decoder system.

Source: KiKaBeN - Transformer’s Encoder-Decoder

To sum up, if the Q values come from the same layer, it produces a hidden state like the lstm layer. This hidden state serves to semantic encode the input for the model. If the Q value comes from the decoder, and the K and V values come from the encoder, a single word is created thanks to an attention-layer, just like in the lstm-based model. And these words are added and re-evaluated every cycle to determine the probability of the next word until .

If you still think I’m wrong with this intuition, feel free to warn me.

I hope this will be helpful for others to understand as well.

Topic		Replies	Views
Conceptual Questions about Transformers Sequence Models coursera-platform	13	677	April 23, 2023
Natural Language Processing Specialization - C4W2_Assignment - Transformer Summarizer 14-June-2024 version NLP with Attention Models	1	264	June 14, 2024
C5W4 Query analogy for weight matrices Sequence Models coursera-platform	10	705	March 25, 2023
Confusion about Q, K, and V matrices NLP with Attention Models week-module-2	9	7120	February 17, 2025
How magical is the Transformer NLP with Attention Models week-module-2	4	619	January 29, 2022

I can't quite understand the transformer structure

Related topics