Hi @Anthony_Wu ,
These are very important questions regarding NLP. My answer will be oriented to the Transformers architecture (from the paper Attention Is All You Need), and in particular to the Self-Attention module of the transformer.
Regarding Q, K, and V Matrices, in the context of the self-attention mechanism:
Q (Query): This represents the processed information of the current word. It’s a matrix that helps in the scoring process to see how relevant other words are to the current word.
K (Key): This represents the processed information of all the words in the sentence, including the current word. It’s used to compute a score that represents the relationship between different parts of the sentence.
V (Value): This represents the raw information of all words in the sentence. Once the scores between different parts of the sentence are computed, they are used to weight the Value matrix, which in turn gives an aggregated representation of the words in context.
Lets gain intuition on this with a metaphor:
This metaphor may be a bit far-fetched but it helped me understand and consolidate my intuition on QKV.
Lets think of a Google Search.
In a Google Search you enter a term to look for something. This term, in our attention mechanism, would be the “Query”.
When you enter this term, Google presents possible options that answer your question. These would be our “Keys”.
And then you pick one of Google’s suggestions and open the content. This would be the “Value”.
Obtaining the Value Vector
The Value vectors, as well as the Q and K vectors, are obtained by multiplying the input embeddings with a weight matrix specific to the value representation. This weight matrix is learned during the training process. When the training is going to start, these weight matrices are initialized with random values.
Multi-Headed vs. Single-Headed Attention
In single-headed attention, the Q, K, and V matrices are derived directly from the input, often through different learned weight matrices.
In multi-headed attention, the idea is extended by having multiple sets of weight matrices for Q, K, and V, resulting in multiple heads that attend to different parts of the input space. Each head might learn to pay attention to different aspects or relationships in the data.
One important detail in multi-headed attention is that the dimensions of the Q, K, V are affected by the number of heads.
Clarification about Q and Embeddings
In your question, there’s a confusion about Q containing the embeddings for the to-be-translated words, and K containing the embeddings for the translated ones. This doesn’t align with the typical explanation of self-attention.
- The Query (Q) is often derived from the word for which you want to calculate the attention score.
- The Key (K) and Value (V) are derived from all the words in the context, including the current word itself.
- This mechanism doesn’t relate to translation or translated words, but rather to weighing the importance of other words in relation to the current word in the sentence.
Process in Both Single and Multi-Headed Attention
Calculate Q, K, V: These are obtained by multiplying the input embeddings with learned weight matrices specific to Queries, Keys, and Values.
Calculate Scores: Compute the dot product of Q and K, followed by scaling and applying a softmax function. Here I also have a “trick” to gain intuition: In a way, this dt-product is similar to the cosine-similarity, and the result I see it as a mask that will be applied to the Value in the next step and will help focus attention in some aspects of the input.
Compute Weighted Sum: Multiply the scores with the V matrix to get a weighted representation.
(In Multi-Headed Attention) Concatenate Heads: If using multi-headed attention, repeat the above steps for each head and concatenate the results. Remember what I said about affecting the dimension with the number of heads? well, thanks to that, the concatenation will end up with a size equal to the previous dimension (previous to the self-attention mechanism).
Final Linear Layer: Pass through a final learned linear layer.
Both single-headed and multi-headed attention mechanisms utilize this process. The multi-headed approach simply extends the single-headed mechanism by repeating it across different ‘heads’ or subspaces of the input.
Understanding the above is very important to understand transformers. It took me a lot of programming transformers for different tasks, and a lot of reading many articles about the same information, just from different authors. I still am working on perfecting my understanding. I hope my explanation helps you to get closer to that understanding