Self-Attention Summation and Information Loss

Jozdien · July 5, 2021, 8:05pm

In Self-Attention, the equation for calculating the attention of a word goes:

Capture

My question is - doesn’t the value of this vector thus computed, not give any information about which words it’s actually weighting? The next block only receives output from this one, so how would it be able to differentiate between, for example, two extreme scenarios: one where the attention depends entirely on the word x, which has a softmax output of X; and one where the attention depends entirely on the word x<t+1>, which also has a softmax output of X? In both, the summation will result in the same attention value. (Having the exact same values may be unlikely, but my point is that the output being X gives no information about which word it’s coming from, after summation).

TMosh · July 8, 2021, 5:19am

I recommend you review the Self-Attention lecture in Week 4, starting around 4:40. The query, key, and value matrices are all learned from the training set of ‘x’ examples. Each of them has its own learned weight matrix.

Jozdien · July 8, 2021, 9:25am

I understood that - my question was about preserving the information as to which word in the sentence is given more attention in the final vector A. If all you had was the resultant sum over all i words (which is all that’s passed to the next block, if I’m correct), then you wouldn’t be able to tell which word is being valued, i.e, which word i had the highest q.k and v.

TMosh · July 8, 2021, 7:58pm

I don’t think that’s correct.

Topic		Replies	Views
C5W4A1 Understanding Self-Attention Sequence Models week-module-4 , coursera-platform	2	343	February 25, 2024
Summation in self-attention Sequence Models coursera-platform	3	569	September 17, 2021
C5W4 Quiz: Self-attention Sequence Models coursera-platform	5	769	July 12, 2022
Self-Attention formula Sequence Models week-module-4 , coursera-platform	1	155	May 1, 2024
C5W4 - In need of attention regarding 'multi-head attention' Sequence Models week-module-4 , coursera-platform	5	143	May 30, 2024

Self-Attention Summation and Information Loss

Related topics