In Self-Attention, the equation for calculating the attention of a word goes:
My question is - doesn’t the value of this vector thus computed, not give any information about which words it’s actually weighting? The next block only receives output from this one, so how would it be able to differentiate between, for example, two extreme scenarios: one where the attention depends entirely on the word x, which has a softmax output of X; and one where the attention depends entirely on the word x<t+1>, which also has a softmax output of X? In both, the summation will result in the same attention value. (Having the exact same values may be unlikely, but my point is that the output being X gives no information about which word it’s coming from, after summation).