Why Q, V and K should be the same in self-attention?
Hi @shaya_kahn
should be same in pretext of its numerical value?
As you in the self-attention mechanism, the Q, k and V represents query (current position-word vector), key(k) being the type of information available, consisting of all position-word vectors. The value (V) encompasses that information itself, consisting of another input of all position-word vectors.
In self-attention, an input x (represented as a vector) is turned into a vector z via three representational vectors of the input: q(queries), k(keys) and v(values) which allows an input position in particular sequences with itself as well as all other inputs, so that encoder and decoder when used in to translate a word from English to a different language, it vectors presentation matches with the same significance of its position in a sequence to all other inputs in the different language. This prevents long range dependencies problem when it comes language translation and help in translating any any word or sequences (of word) independently to translate with more precisely.
Regards
DP
I’m not quite sure they teach us how to train, but they go into much more detail on this in NLP. At first, this really confused me too, but I think you will finally ‘get it’ if you see they are jumping from the encoder to the decoder, or QVK, this is just like SQL, but they are going from the reference to the translation.
Query will be ‘word we’re looking for’, key will be the translated value, and then v will be your word vector.
It is like a big Python dictionary, yet this confused me too for so long.
The query (Q), key (K), and value (V) vectors are derived from the same input word embeddings but transformed by different learned weight matrices. While they are not necessarily “the same” in terms of their roles (the query focuses on finding relevant keys, the keys represent different parts of the sequence, and the values hold the content), they are similar in that they are all derived from the same input embeddings to ensure that the self-attention mechanism can meaningfully and efficiently compute contextual relationships between words in a sentence.
Thanks, but I don’t understand yet why we set q=k=v=x in the Encoder for the multi-head attention?
We are not assigning q=k=v=x in encoder,
but we create a matrix using the q, k and v relative to a given x, so it’s x position is determined from its ability to relate independently as well as positional to all other inputs(x’s), allowing x to be determined in decoder as well as transformer model using the similar matrix used in encoder for a given a input.
Someone better than I can speak to this, but I think what we are dealing with is a bit of a misnomer. It is not like Q = K = V. It is not math. It is talking about the relationship between the objects.
I mean, obviously yes there is math involved in how we calculate these things. But all they are making is a literal statement. Again, see SQL.
Ok, I understand now, thank you The notation just confused me.