Self-Attention Summation and Information Loss

In Self-Attention, the equation for calculating the attention of a word goes:


My question is - doesn’t the value of this vector thus computed, not give any information about which words it’s actually weighting? The next block only receives output from this one, so how would it be able to differentiate between, for example, two extreme scenarios: one where the attention depends entirely on the word x, which has a softmax output of X; and one where the attention depends entirely on the word x<t+1>, which also has a softmax output of X? In both, the summation will result in the same attention value. (Having the exact same values may be unlikely, but my point is that the output being X gives no information about which word it’s coming from, after summation).

I recommend you review the Self-Attention lecture in Week 4, starting around 4:40. The query, key, and value matrices are all learned from the training set of ‘x’ examples. Each of them has its own learned weight matrix.

I understood that - my question was about preserving the information as to which word in the sentence is given more attention in the final vector A. If all you had was the resultant sum over all i words (which is all that’s passed to the next block, if I’m correct), then you wouldn’t be able to tell which word is being valued, i.e, which word i had the highest q.k and v.

I don’t think that’s correct.