Confusion about Q, K, and V matrices

Can someone explain clearly on what exactly are the Q, K, and V matrices? How exactly do we get the value vector? I know that Q contains the embeddings for the words that we want to be translated and K contains the embeddings for the words that are the translated ones. So what exactly is V and how do we obtain it?

Another question is that in multi-headed attention, we multiply the words (embeddings) that we want to be translated with Q, K, and V matrices, but doesn’t Q itself contains the embeddings for the to-be-translated words? Does this embedding multiplication also happen in single headed attention?

Thanks!

1 Like

Hi @Anthony_Wu ,

These are very important questions regarding NLP. My answer will be oriented to the Transformers architecture (from the paper Attention Is All You Need), and in particular to the Self-Attention module of the transformer.

Regarding Q, K, and V Matrices, in the context of the self-attention mechanism:

  • Q (Query): This represents the processed information of the current word. It’s a matrix that helps in the scoring process to see how relevant other words are to the current word.

  • K (Key): This represents the processed information of all the words in the sentence, including the current word. It’s used to compute a score that represents the relationship between different parts of the sentence.

  • V (Value): This represents the raw information of all words in the sentence. Once the scores between different parts of the sentence are computed, they are used to weight the Value matrix, which in turn gives an aggregated representation of the words in context.

Lets gain intuition on this with a metaphor:

This metaphor may be a bit far-fetched but it helped me understand and consolidate my intuition on QKV.

Lets think of a Google Search.

In a Google Search you enter a term to look for something. This term, in our attention mechanism, would be the “Query”.
When you enter this term, Google presents possible options that answer your question. These would be our “Keys”.
And then you pick one of Google’s suggestions and open the content. This would be the “Value”.

Obtaining the Value Vector

The Value vectors, as well as the Q and K vectors, are obtained by multiplying the input embeddings with a weight matrix specific to the value representation. This weight matrix is learned during the training process. When the training is going to start, these weight matrices are initialized with random values.

Multi-Headed vs. Single-Headed Attention

In single-headed attention, the Q, K, and V matrices are derived directly from the input, often through different learned weight matrices.

In multi-headed attention, the idea is extended by having multiple sets of weight matrices for Q, K, and V, resulting in multiple heads that attend to different parts of the input space. Each head might learn to pay attention to different aspects or relationships in the data.

One important detail in multi-headed attention is that the dimensions of the Q, K, V are affected by the number of heads.

Clarification about Q and Embeddings

In your question, there’s a confusion about Q containing the embeddings for the to-be-translated words, and K containing the embeddings for the translated ones. This doesn’t align with the typical explanation of self-attention.

  • The Query (Q) is often derived from the word for which you want to calculate the attention score.
  • The Key (K) and Value (V) are derived from all the words in the context, including the current word itself.
  • This mechanism doesn’t relate to translation or translated words, but rather to weighing the importance of other words in relation to the current word in the sentence.

Process in Both Single and Multi-Headed Attention

  1. Calculate Q, K, V: These are obtained by multiplying the input embeddings with learned weight matrices specific to Queries, Keys, and Values.

  2. Calculate Scores: Compute the dot product of Q and K, followed by scaling and applying a softmax function. Here I also have a “trick” to gain intuition: In a way, this dt-product is similar to the cosine-similarity, and the result I see it as a mask that will be applied to the Value in the next step and will help focus attention in some aspects of the input.

  3. Compute Weighted Sum: Multiply the scores with the V matrix to get a weighted representation.

  4. (In Multi-Headed Attention) Concatenate Heads: If using multi-headed attention, repeat the above steps for each head and concatenate the results. Remember what I said about affecting the dimension with the number of heads? well, thanks to that, the concatenation will end up with a size equal to the previous dimension (previous to the self-attention mechanism).

  5. Final Linear Layer: Pass through a final learned linear layer.

Both single-headed and multi-headed attention mechanisms utilize this process. The multi-headed approach simply extends the single-headed mechanism by repeating it across different ‘heads’ or subspaces of the input.

Understanding the above is very important to understand transformers. It took me a lot of programming transformers for different tasks, and a lot of reading many articles about the same information, just from different authors. I still am working on perfecting my understanding. I hope my explanation helps you to get closer to that understanding :slight_smile:

Cheers!

Juan

1 Like

Hi @Anthony_Wu

“Original” embeddings multiplied by weight matrices W_q, W_k and W_v results in Q, K and V matrices. Image from the course:

Important thing to understand is that Q contains transformed embeddings for the words to be translated. Same thing for the words that are translated - K and V are transformed embeddings (due to different W_k and W_v values you get different K and V).

You might rephrase this question after reading my answers to your previous questions.

But in essence, the the single head attention tries to represent all the information with one big matrix, while multi-head attention tries to represent all the information with handful but smaller matrices, hence each head can specialize for different things but has to do that with less space.

Cheers

1 Like

Hi @Juan_Olano

I think @Anthony_Wu is asking about “Cross Attention” and not “Self-Attention” because he is talking about encoder-decoder transformers architecture, and in translation typically:
Q = Target sentence (for example, German)
K = V = Source sentence (for example, English)

I would also point out that in any case (Self-Attention or Cross-Attention) Q does not come from one word, but from all of them in a sequence.

Cheers

Thanks for the note @arvyzukai !

May be you can shed light on my understanding of the situation:

When executing the attention on a series of tokens, where each token has been converted into an embedding vector, it is my understanding that the attention mechanism will Query each word against the rest of the words, or more precisely, each embedding vector (Q) against the rest of embedding vectors (K), to find the ‘best match’ (V). Although this may be happing in parallel for each and all embeddings, if we were going to serialize it to try to gain some intuition, the serialized version of it would be one word (or more technically one embedding that comes from a token) versus all other words (embeddings).

No problem, I’m happy to elaborate @Juan_Olano !

I raised my point because I understand these two statements as different:

vs:

In other words, I find the second statement as true, while the first one as confusing. I wanted to point out that each word in a sequence is “communicating” (my preference as a word of choice, vs., you using the word “querying”) with every other word (including itself).


If I understand you correctly, then yes, if we go one row (word) at a time in the Attention matrix we find the percentage of how much value from each and every other token in a sequence (from matrix V) to “integrate”.

I find this picture (thanks to Elemento) in this DLS thread very concise:

In particular, what I mean when saying “one row at a time” is that this:
image

results in a square matrix (6 x 6) in this case, and not (1 x 6)… but as you say, we can go line by line in a serialized manner, we can see how much “attention” this word is paying to each other word (including itself).

So the resulting matrix of this head, for example, could be:

image

Which, if we interpret as you say in serialized fashion, the first word would accumulate most (79%) from the V of word 2, while words 4, 5 and 6 would mostly equally pay attention to each other and most of its values be average of V^{<4>}, V^{<5>} and V^{<6>} rows.

Even though this picture is about Encoder part (where “Self Attention” is used), @Anthony_Wu was asking about the Decoder (and the “Cross Attention” part) where calculations are similar but Q comes from one sentence and the K and V comes from another (note a not important detail: often even different length, but it must be padded in any case).

But in reality, this matrix is a “single” result for each head (and as a side note, most often attention is concentrated to the tokens in the same position):
image


Anyways, I think your understanding is correct, but I wanted to point out these points because I found them a bit confusing in your first response.

Cheers

Very nice and clear explanation, thank you @arvyzukai !

Just to clarify, so for the example of translating from English to Russian, does Q come from the transformed embeddings of the English sentence? Does K and V come from the transformed embeddings of the “so-far” generated Russian words from the decoder? Then using Q, K, and V, we can use it to generate the next Russian word?

@Anthony_Wu , Thank you for your follow up question! You are very well into it now, great!

Regarding your question:

In the Encoder-Decoder architecture, and specifically in the cross-attention part of it, the Encoder is giving the KEY and VALUE to the Decoder, and, of course, the QUERY is from the Decoder.

If you think about the analogy with Google’s search, this kind of makes sense: The user enters de ‘query’, and google shows the ‘keys’ and then the user selects one ‘key’ to see its ‘value’. Remember that this is just a metaphor to gain intuition.

Thoughts?

Thanks!

Juan

This is by far the best explanation I’ve found: https://www.youtube.com/watch?v=eMlx5fFNoYc&list=PLRxuk-M6hcZpeqJ4nDbi6zkITVa8itZfy&index=5&t=3s