I can't quite understand the transformer structure

I don’t fully understand the transformations. For a while I went through the sites trying to explain transformers and ‘Tansformers were created to reduce the processing cost of lstm and lstm-based attention mechanisms through parallelization.’ I know the expression. But I’m having trouble understanding the transformer structure:
1- Positional Embedding uses sin and cos operations to add information for their order to the data.
2- Head Attention acts as lstm, which contains an attention mechanism. The V(value) here is extracted by sending the sequential information to the fc layer. In other words, lstm’s processing of sequential data occurs by removing positional embedding and V values ​​in transformers. 3-Query and key values ​​give an attention score, that is, how much an output is affected by other words. The resulting word is obtained by multiplying the naturally occurring value (data encoding) and attention score.
So I’m saying that a sentence is created from the input sentence at the end of the first attention of the encoder. But I don’t know what sentence it is.
4-Then this sentence extracts the query and key with FF, that is, it transfers the attention value in its own sentence to the decoder. It takes the output sentence in the decoder and outputs a sentence. This sentence uses its own value, query and key comes from encoder.
The problem is, I don’t quite understand why you would do such a thing here. Is the purpose of comparing the semantic matrix result of the output sentence in the decoder and the similarity of the result of the encoder?
Afterwards, why multi head attention was used in the decoder again, what is the structure in order to extract this result? Thanks for the answers in advance.

Hi @Kerem_Boyuk

Welcome to the community.

Nice overview!

I will just breaking down some transformers structures and concepts, Ok?!

  1. Positional Embeddings: You’re correct that positional embeddings are used to provide information about the order of words in the input sequence. This is crucial because transformers don’t inherently have a built-in sense of sequential order like recurrent models. The use of sine and cosine functions helps encode the position information into embeddings.

  2. Self-Attention Mechanism (Not “Head Attention”):

    • The self-attention mechanism allows each word in the input sequence to focus on other words to capture contextual relationships.
    • There are three kinds of transformations applied to each input word: Queries (Q), Keys (K), and Values (V).
    • The dot product of the Query of a word with the Key of another word produces an attention score, indicating the relevance of that word to the first word.
    • This attention score is used to weight the Values of the other words, and the weighted sum is the contextual representation of the first word based on the information from other words.
  3. Multi-Head Attention in Encoder:

    • The self-attention mechanism is performed multiple times in parallel, each time using different learned linear projections (Queries, Keys, and Values). These are the “heads” in multi-head attention.
    • The outputs of these parallel self-attention heads are concatenated and linearly transformed again to produce the final output.
    • This process allows the model to focus on different aspects of the input and learn different relationships.
  4. Encoder-Decoder Architecture:

    • The transformer architecture consists of an encoder and a decoder for sequence-to-sequence tasks like translation.
    • The encoder processes the input sequence and produces a representation of it, which is then used by the decoder.
    • The decoder also employs a self-attention mechanism to capture dependencies among output words.
    • Additionally, the decoder uses an attention mechanism over the encoder’s output to incorporate information from the input sequence.
  5. Multi-Head Attention in Decoder:

    • The decoder’s multi-head attention has two parts: one self-attention mechanism over the decoder’s own output (similar to the encoder), and another attention mechanism over the encoder’s output.
    • The combination of these two attention mechanisms enables the decoder to focus on relevant parts of the input and its own generated output, helping generate accurate translations or predictions.

The goal of the transformer is not to directly compare the semantic matrices of the encoder and decoder. Instead, it aims to capture rich contextual information from the input sequence and use it effectively for generating accurate and contextually relevant output sequences.

In summary, the transformer’s power lies in its self-attention mechanism, parallel processing with multi-head attention, and the encoder-decoder architecture, which collectively enable it to model complex relationships in sequential data more effectively than traditional recurrent models like LSTMs.

I hope this helps

Best regards
elirod

1 Like

hey @elirod thank you for your answer. I guess I had a hard time understanding the system where Q, K and V are in general here. At first I learned the lstm-based attention mechanism and had no difficulties. Here, very simply, at first, a semantic represantation is output with the first lstm and a score is specified for how much they affect the first word of the other lstm. Maybe it’s my fault that I tried to see this system for self-attention as well. I likened the V value and poisitonal embedding to the first lstm layer in the attention mechanism that I mentioned earlier to produce results, and I compared the use of Q,K to creating the attention score in this mechanism. In the first mechanism, a sentence was being created, and I thought that since I compared these two mechanisms, a sentence should come out of the self-attention mechanism and that it was to compare the results from the decoder and encoder. In short, the part I don’t understand is self-attention in general. According to your answer, should I assume that self-attention has a semantic response? The Encoder produces a semantic equivalent of the input and the Decoder produces a semantic equivalent of the output. Since I don’t fully understand the self-attention mechanism afterwards, I don’t understand why the Q and K values ​​from the encoder are processed with the output of the decoder. In addition, the softmax layer at the end made me think that these two sentences were compared. I would like to ask the purpose of that layer. Thank you again.

Hi @Kerem_Boyuk

I would offer you to go through a very similar recent thread and in particular to try to understand this picture (of Scaled Multi-Head Dot Product Attention):

In my head, it is somewhat similar to the lstm-based attention:

The main difference that I see is that in the lstm-based encoder the hidden-states are the result of lstm network, while the “hidden states” of the transformer’s encoder are the dot products of each token transformed embeddings.

Anyways, if you hard understanding the first picture, feel free to ask question about specific sections.

Cheers

1 Like

Hey @arvyzukai

Thank you for your answer.
I looked at the chart you posted. I know the operations, but I’m not sure about their meaning and what they do. Additionally, I’m a bit confused as there are 2 answers in the link you posted.
Please correct my mistakes:
1-Transformer works like lstm encoder, so lstm processes data respectively. It generates a semantic equivalent for each word and then this data is sent to the Decoder.
2- Transformer parallelizes lstm tasks, for example, positional encoding provides sequence information in parallel. Q, K, and V represent being affected by other data, but unlike lstm, it still performs a parallel operation.
3- The K values ​​of each word actually combine with other queries like a key and extract the meaning between that word. In short, it is similar to the semantic correspondence extraction process in the list. Then it is multiplied by V. I do not know the exact value of V, I would appreciate if you explain.
4- At the end, a meaning comes out for each word as in the list.

In addition, I would like to point out that I could not find many sources about the decoder, so I do not know the attention mechanism of the decoder, nor the relationship between the decoder and encoder. For example, it is very strange that the K and Q values ​​that confuse me in the encoder-decoder relationship are evaluated by the encoder and the V value is evaluated with the decoder. Therefore, I cannot understand the Q, K and V values.

You are not alone. Very few people understand how a Transformer works. The documentation and explanations are extremely confusing. I am also one who doesn’t understand it.

Hey @TMosh
I totally agree. I’ve been to many blog and they say that the docs don’t explain exactly what some things do. But perhaps this is an opportunity for us to understand. We should continue to go for it. :smiley:

Hi @Kerem_Boyuk

The underlying processes are quite different but the goal is the same - as you stated, to generate a representation of the sequence (each token is represented with its own piece) for a decoder (if there is one).

If I think what you mean is what you mean :slight_smile: , then that statement is correct.

Just to elaborate a bit - there are different “flavors” of transformers with different underlying parts. And, directly comparing underlying processes of lstms with transformers is not fair, they are quite different.
Positional encoding is necessary for transformers because “attention” mechanism does not “care” about the position of the tokens and all attention values are calculated in “one go”. For example, if no positional information is present in the embeddings, the two sequences “man bites dog” would be the same as “dog bites man”, the only difference would be that rows (1 and 3) would be switched. As a concrete example:
image
Without positional information, the two matrices (the result of attention) would be identical in values (the only difference that q<1> row would be switched with q<3>).
With positional information, the two matrices would have different values (most probably, totally different).
On the other hand, LSTMs process the sequences sequentially so the positional information is intrinsic.

Let me offer my intuition (which is very much in line with Andrej Karpathy’s here, *note that here he is talking about causal self attention (the bottom part attention in the decoder), but the intuition is the same), where attention is seen as a communication mechanism:
image
Image taken from this paper.
Here you can imagine each node as token (word) and the lines as the attention weights - the thicker the line, the more attention.
To get the thickness of these lines:

  • first, we transform the original embeddings (with injected positional information) with emb \cdot W_q and emb \cdot W_k and get the Q and K values.
  • then, we multiply Q with K to get the thickness of these lines (for example, like in the table above, token <1> would have a thick line with token <2>, and tokens <4> to <6> would have somewhat even thickness lines between themselves.).

So, the intuition in loose terms:

  • Q represents “what each node is looking for”,
  • K represents “what each node has to offer”
  • attention (Q \cdot K) - how much of each value every token (including itself) should accumulate/aggregate (the thickness of the lines).
  • V represents “what to aggregate” (what are these lines “pulling in” or aggregating).

The values themselves are also a linear transformation of the original embeddings - emb \cdot W_v. So, when we dot multiply the attention with values the result is some matrix (for example, 6x2) which represents how much of the value each token took from each other. In the previous table, the token <1> would have “pulled in/ingested” 79% of the “value” from token <2>, the rest (21%) “value” would come from other tokens.

Usually there is one more linear transformation at the end - Concat(head_1, head_2 .. head_n) \cdot W_o - which is hard to interpret in loose terms, but in my intuition it “coordinates” each head responses before the residual connection (adding these values with the original embeddings (positional encoding included)).

After that we normalize the values (which helps with the gradient, keeps these values “in check”).

Now comes the time for the “computation” - the feed forward layer, the non-linear transformation (single layer or multi-layer NN, two layer in the original paper), which in loose terms could be interpreted as “thinking on” what these head communicated with each other in addition to the original embeddings and what to add to it. In other words, here decisions are being made :slight_smile: “what to add to the values after normalization”.

After that we again normalize the values and this time pass these to another block and on and on.

The attention mechanism in this particular transformer (from the Attention Is All You Need paper) is dot-product attention (I don’t want to complicate things for you, but there are more types of attention mechanisms used in different type of architectures.).

There are three types of this flavor in it:

The encoders job - represent the inputs of the source sequence (English sentence)
The decoders job - generate the outputs of the target sequence (German sentence). The “Causal” self-attention allows to work with only nodes that are already translated. The Cross-attention lets communicate between what the encoder did with English sentence with what the “Causal” attention did to the translation so far.

I hope that makes sense :slight_smile:

Cheers

1 Like

Hey @arvyzukai
Thank you for your answer.

Maybe I didn’t explain the point that I don’t fully understand.
For a long time, I tried to understand the transformers structure by browsing the blogs, and of course, your answers helped a lot.
From the very beginning, I tried to understand transformers as an example from the lstm-based encoder and decoder structure.
The 3rd answer in this blog(neural networks - What exactly are keys, queries, and values in attention mechanisms? - Cross Validated) was very helpful for my intuition development. The sharp lines of your answers and what I understood as a result of this blog are as follows:

  1. I can compare the encoder exactly to the lstm encoder.
  • 1.1-In addition to embedding, the data is processed sequentially in the list. In order to close this gap, the transformer adds different data to each data, the order of which is important. This video explains the whole thing : https://www.youtube.com/watch?v=dichIcUZfOw

  • 1.2- The blog answer I gave at the beginning nicely explains which values ​​Q,K and V are in lstm-based. While thinking about this answer, I saw that it was a simple lstm operation, where the Q value came not from the decoder but from the same encoder lstm that did the operation. In other words, if you get the Q value from the input in the transformer, this mechanism works like an encoder lstn and produces a hidden state. So now you have a semantic equivalent of the input.

  1. This is how the first Multi-Head Attentions of the Decoder and Encoder layers work in the Transformer architecture. In other words, they produce hidden state just like the simple lstm layers found in the lstm-based encoder decoder. The differences are positional encoding and parallelized operations.

  2. The second Multi-Head Attention of the decoder works like the cross-attention operation, that is, the attention layer between the lstm-based encoder and the decoder. The Q value comes from the hidden decoder stat, and the K and V values ​​come from the Encoder hidden state, so cross attention and the result are output. I think we can solve this by looking at the same graph.

“Here, the query is from the decoder hidden state, the key and value are from the encoder hidden states (key and value are the same in this figure). The score is the compatibility between the query and key, which can be a dot product between the query and key (or other form of compatibility). The scores then go through the softmax function to yield a set of weights whose sum equals 1. Each weight multiplies its corresponding values to yield the context vector which utilizes all the input hidden states.”

One of the most important things I understood was that just like the lstm-based decoder produces a word in each attention cycle, the transformer decoder processes the whole sentence by adding a word to its output according to the output probabilities. Transformer emulates exactly the lstm-based encoder-decoder system.


Source: KiKaBeN - Transformer’s Encoder-Decoder

To sum up, if the Q values ​​come from the same layer, it produces a hidden state like the lstm layer. This hidden state serves to semantic encode the input for the model. If the Q value comes from the decoder, and the K and V values ​​come from the encoder, a single word is created thanks to an attention-layer, just like in the lstm-based model. And these words are added and re-evaluated every cycle to determine the probability of the next word until .

If you still think I’m wrong with this intuition, feel free to warn me.

I hope this will be helpful for others to understand as well. :smile: