No problem, I’m happy to elaborate @Juan_Olano !
I raised my point because I understand these two statements as different:
vs:
In other words, I find the second statement as true, while the first one as confusing. I wanted to point out that each word in a sequence is “communicating” (my preference as a word of choice, vs., you using the word “querying”) with every other word (including itself).
If I understand you correctly, then yes, if we go one row (word) at a time in the Attention matrix we find the percentage of how much value from each and every other token in a sequence (from matrix V) to “integrate”.
I find this picture (thanks to Elemento) in this DLS thread very concise:
In particular, what I mean when saying “one row at a time” is that this:
results in a square matrix (6 x 6) in this case, and not (1 x 6)… but as you say, we can go line by line in a serialized manner, we can see how much “attention” this word is paying to each other word (including itself).
So the resulting matrix of this head, for example, could be:
Which, if we interpret as you say in serialized fashion, the first word would accumulate most (79%) from the V of word 2, while words 4, 5 and 6 would mostly equally pay attention to each other and most of its values be average of V^{<4>}, V^{<5>} and V^{<6>} rows.
Even though this picture is about Encoder part (where “Self Attention” is used), @Anthony_Wu was asking about the Decoder (and the “Cross Attention” part) where calculations are similar but Q comes from one sentence and the K and V comes from another (note a not important detail: often even different length, but it must be padded in any case).
But in reality, this matrix is a “single” result for each head (and as a side note, most often attention is concentrated to the tokens in the same position):
Anyways, I think your understanding is correct, but I wanted to point out these points because I found them a bit confusing in your first response.
Cheers