Confusion about Q, K, and V matrices

arvyzukai · August 24, 2023, 12:48pm

No problem, I’m happy to elaborate @Juan_Olano !

I raised my point because I understand these two statements as different:

vs:

In other words, I find the second statement as true, while the first one as confusing. I wanted to point out that each word in a sequence is “communicating” (my preference as a word of choice, vs., you using the word “querying”) with every other word (including itself).

If I understand you correctly, then yes, if we go one row (word) at a time in the Attention matrix we find the percentage of how much value from each and every other token in a sequence (from matrix V) to “integrate”.

I find this picture (thanks to Elemento) in this DLS thread very concise:

In particular, what I mean when saying “one row at a time” is that this:

results in a square matrix (6 x 6) in this case, and not (1 x 6)… but as you say, we can go line by line in a serialized manner, we can see how much “attention” this word is paying to each other word (including itself).

So the resulting matrix of this head, for example, could be:

Which, if we interpret as you say in serialized fashion, the first word would accumulate most (79%) from the V of word 2, while words 4, 5 and 6 would mostly equally pay attention to each other and most of its values be average of V^{<4>}, V^{<5>} and V^{<6>} rows.

Even though this picture is about Encoder part (where “Self Attention” is used), @Anthony_Wu was asking about the Decoder (and the “Cross Attention” part) where calculations are similar but Q comes from one sentence and the K and V comes from another (note a not important detail: often even different length, but it must be padded in any case).

But in reality, this matrix is a “single” result for each head (and as a side note, most often attention is concentrated to the tokens in the same position):

Anyways, I think your understanding is correct, but I wanted to point out these points because I found them a bit confusing in your first response.

Cheers

Topic		Replies	Views
Self-attention in the Transformer Network Sequence Models week-4	7	78	August 15, 2024
Is there an additional weight matrix layer for K,Q and V Sequence Models	9	424	August 16, 2023
Problems Interpreting the Query, Key and Value matrices: NLP with Attention Models week-1	2	958	December 13, 2022
Q,K,V all are same for self attention Sequence Models	5	649	November 19, 2023
Learning q, k, v in self-attention and multihead attention Sequence Models	1	566	January 26, 2023

Confusion about Q, K, and V matrices

Related topics