Confusion about Q, K, and V matrices

No problem, I’m happy to elaborate @Juan_Olano !

I raised my point because I understand these two statements as different:

vs:

In other words, I find the second statement as true, while the first one as confusing. I wanted to point out that each word in a sequence is “communicating” (my preference as a word of choice, vs., you using the word “querying”) with every other word (including itself).


If I understand you correctly, then yes, if we go one row (word) at a time in the Attention matrix we find the percentage of how much value from each and every other token in a sequence (from matrix V) to “integrate”.

I find this picture (thanks to Elemento) in this DLS thread very concise:

In particular, what I mean when saying “one row at a time” is that this:
image

results in a square matrix (6 x 6) in this case, and not (1 x 6)… but as you say, we can go line by line in a serialized manner, we can see how much “attention” this word is paying to each other word (including itself).

So the resulting matrix of this head, for example, could be:

image

Which, if we interpret as you say in serialized fashion, the first word would accumulate most (79%) from the V of word 2, while words 4, 5 and 6 would mostly equally pay attention to each other and most of its values be average of V^{<4>}, V^{<5>} and V^{<6>} rows.

Even though this picture is about Encoder part (where “Self Attention” is used), @Anthony_Wu was asking about the Decoder (and the “Cross Attention” part) where calculations are similar but Q comes from one sentence and the K and V comes from another (note a not important detail: often even different length, but it must be padded in any case).

But in reality, this matrix is a “single” result for each head (and as a side note, most often attention is concentrated to the tokens in the same position):
image


Anyways, I think your understanding is correct, but I wanted to point out these points because I found them a bit confusing in your first response.

Cheers