C5W4A1 Understanding Self-Attention


I’ve failed to understand how relations between words in a sentence are represented in the output of self-attention. Let’s take the following example of the input sentence: “the black cat sat on the mat.”

  1. The attention scores are straightforward. For example, we have the word “black” and we calculate the attention score for the word “black” and the word “the”, the word "black” and the word "cat”, and so on. Let us now focus on the pair “black” and “cat” only.

  2. In the next step we multiply the attention score for the above pair of words by the word “cat” (its Value vector in fact). In a somewhat simplified notation, it looks like this: Q2K3V3 (q2 is for black, k3 is for cat, v3 is for cat, too). I omit scaling and softmax for simplicity.

  3. So far so good, i.e. we do the same kind of calculations for all the other pairs beginning with the word “black”, for example “black” and “the”, “black” and “black”, and so on.

  4. Where I get lost begins here: in row 2 (the row for black), column 1 we sum up all of the results from item 3 above (for one dimension of the Value vector in fact, the second dimension is in column 2, etc.)

How, on earth, does this weighted sum of the attention scores for the word “black” times the Values for all the words in the sentence know that it is the words “black” and “cat” that attend to each other most strongly? Isn’t the information from the attention score lost forever due to summation? Evidently, it is not because Transformers do work but I don’t understand why.

For the last few days, I’ve read multiple articles on the web including Attention is All You Need but they don’t explain that. I guess the explanation may be too long or two mathematical for this help forum but I would appreciate a reference to some external resources (including the paid ones).

Please note: my question may be similar to the one asked here but no clear answer has been provided at that post:

1 Like

The mechanisms here are pretty complicated and the first time I went through everything, it all felt like it just went by too fast. I’m still not claiming that I understand everything, but my suggestion is that you should watch the “Self Attention” lecture again with the following idea in mind: the A value in that formula is not a single value. The point is that we compute that complete formula for each word in the input sentence. For each word’s A^{<t>} value, we have contributions from its relationships with all the other words in the sentence.

You’ve obviously thought about this pretty hard, so please let me know if you think what I’m saying above is just missing your point. :grinning:

I have just come across a fabulous explanation why self-attention works. It is by Luis Serrano, a good friend of DeepLearning.ai. Enjoy :slight_smile: