Self-attention in the Transformer Network

shaya_kahn · August 15, 2024, 2:04pm

Why Q, V and K should be the same in self-attention?

Deepti_Prasad · August 15, 2024, 4:10pm

should be same in pretext of its numerical value?

As you in the self-attention mechanism, the Q, k and V represents query (current position-word vector), key(k) being the type of information available, consisting of all position-word vectors. The value (V) encompasses that information itself, consisting of another input of all position-word vectors.

In self-attention, an input x (represented as a vector) is turned into a vector z via three representational vectors of the input: q(queries), k(keys) and v(values) which allows an input position in particular sequences with itself as well as all other inputs, so that encoder and decoder when used in to translate a word from English to a different language, it vectors presentation matches with the same significance of its position in a sequence to all other inputs in the different language. This prevents long range dependencies problem when it comes language translation and help in translating any any word or sequences (of word) independently to translate with more precisely.

Regards
DP

Nevermnd · August 15, 2024, 4:34pm

I’m not quite sure they teach us how to train, but they go into much more detail on this in NLP. At first, this really confused me too, but I think you will finally ‘get it’ if you see they are jumping from the encoder to the decoder, or QVK, this is just like SQL, but they are going from the reference to the translation.

Query will be ‘word we’re looking for’, key will be the translated value, and then v will be your word vector.

It is like a big Python dictionary, yet this confused me too for so long.

nadtriana · August 15, 2024, 5:06pm

The query (Q), key (K), and value (V) vectors are derived from the same input word embeddings but transformed by different learned weight matrices. While they are not necessarily “the same” in terms of their roles (the query focuses on finding relevant keys, the keys represent different parts of the sequence, and the values hold the content), they are similar in that they are all derived from the same input embeddings to ensure that the self-attention mechanism can meaningfully and efficiently compute contextual relationships between words in a sentence.

shaya_kahn · August 15, 2024, 5:39pm

Thanks, but I don’t understand yet why we set q=k=v=x in the Encoder for the multi-head attention?

Deepti_Prasad · August 15, 2024, 6:21pm

We are not assigning q=k=v=x in encoder,

but we create a matrix using the q, k and v relative to a given x, so it’s x position is determined from its ability to relate independently as well as positional to all other inputs(x’s), allowing x to be determined in decoder as well as transformer model using the similar matrix used in encoder for a given a input.

Nevermnd · August 15, 2024, 6:23pm

Someone better than I can speak to this, but I think what we are dealing with is a bit of a misnomer. It is not like Q = K = V. It is not math. It is talking about the relationship between the objects.

I mean, obviously yes there is math involved in how we calculate these things. But all they are making is a literal statement. Again, see SQL.

shaya_kahn · August 15, 2024, 7:18pm

Ok, I understand now, thank you The notation just confused me.

Topic		Replies	Views
Q,K,V all are same for self attention Sequence Models coursera-platform	5	662	November 19, 2023
Question on Transformers Sequence Models coursera-platform	3	531	July 16, 2023
Confusion about Q, K, and V matrices NLP with Attention Models week-module-2	9	6233	February 17, 2025
Course 5 - Week 4 - A1 - Exercise 4 - EncoderLayer Sequence Models week-module-4 , coursera-platform	2	40	August 13, 2024
Week 4 of sequence models course Sequence Models coursera-platform	1	494	October 28, 2022

Self-attention in the Transformer Network

Related topics