C5W4 Quiz: Self-attention

One of the quiz questions asks about the concept of Self-Attention. I couldn’t find an unambigously correct one among the 4 provided answers, and I chose the closest one: “its neighbouring words are used to compute its context by taking the average of those word values…”. But apparently this is not regarded as correct. In the lecture, the Attention A is expressed as \sum_i \frac{\exp(q \cdot k^{\langle i \rangle})}{\sum_j \exp(q \cdot k^{\langle j \rangle})} v^{\langle i \rangle}. The fraction term is for the softmax probability for the word i. If we write p^{\langle i \rangle} = \frac{\exp(q \cdot k^{\langle i \rangle})}{\sum_j \exp(q \cdot k^{\langle j \rangle})}, A becomes \sum_i p^{\langle i \rangle} v^{\langle i \rangle}. Isn’t this an average as \sum_i p^{\langle i \rangle} = 1?

Thanks!

I may not catch your point, but here is what I thought from both math and design view points.

…, A becomes \sum_{i}p^{<i>}v^{<i>}. isn’t this an average as \sum_{i}p^{<i>}=1?

If p is uniformly distributed, then, \sum_{i}p^{<i>}v^{<i>} can be an average of v^{<i>}. But, it is not. As you wrote, it is a Softmax probability distribution, not a uniform distribution.

From the design view point, it should be more clear. Here is an overview of Self-attention portion.

Let’s focus on one word, “Jane” in this case just like Andrew explained. (Sorry that I do not know French… So, I’m using English words instead. :wink:)

The first step is to calculate “dot product” of q (which is weighted query, i.e, qW^{Q} to be exact) and k^{<i>} (which is also weighted key, k^{<i>}W^{K}). Then, apply masks, and divide by \frac{1}{\sqrt{d_k}} to avoid explosions. Now, we are ready to apply Softmax.

By a dot product of q and k^{<i>}, we get a similarity vector. By applying Softmax, then, we have a Softmax probability distribution to show which word has more association to “Jane”.
Then, we conduct a dot product of “Softmax probability distribution” of “Jane” and a word vector which consists of “word embedding” and “position encoding”. According to Softmax probability distribution, we can update a word vector to to have more association to words in a sentence. This is the base of attention mechanism.

So, from both math and design viewpoint, it’s not a simple average of v^{<i>}, of course.

Again, I’m not sure I catch your point, but the above is what I though from your question.

Hi Nobu_Asai,

Thanks for the reply. But the formula gives the correct average even when p^{\langle i \rangle} is not uniformly distributed.

As a simple (scalar quantity) example, let’s say that 10 people took an exam, 3 people got 70 points, 5 got 80, and 2 got 90. One way to compute the average is (70 + 70 + 70 + 80 + 80 + 80 + 80 + 80 + 90 + 90) / 10 = 79.

Noting that 3 out of 10 (p_1 = 3/10 = 0.3) got 70, 5 out of 10 (p_2 = 5/10 = 0.5) got 80 and 2 out of 10 (p_3 = 2/10 = 0.2) got 90, we can compute equivalently as p_1 \times 70 + p_2 \times 80 + p_3 \times 90 = 79.

From this example we can tell what \sum_i p^{\langle i \rangle} v^{\langle i \rangle} = p^{\langle 1 \rangle} v^{\langle i \rangle} + p^{\langle 2 \rangle} v^{\langle 2 \rangle} + p^{\langle 3 \rangle} v^{\langle 3 \rangle} + \cdots is about – the average computed using the softmax probability as the weight.

Thanks.

I think we are saying same thing. :slight_smile:

But, the difference is the way to use a term of “average”.
What you wrote is “weighted average”, not “average”. That’s what I pointed out.
Only if the distribution is uniform, then, “weighed average” becomes equal to “average”.

Hope this clarifies.

Hi Nobu,

Thanks again for the answer.

As I said, I couldn’t find the unambiguously correct answer. The one that mentions “the” average doesn’t say what kind of averaging it is referring to exactly. Theare are many ways of defining averaging. A weighted avearge with the softmax probability is one. From the grading result. I only guess that the average here must refer to the simple average of the value vectors over the number of words: \frac{1}{n} \sum_i v^{\langle i \rangle}. But it was not stated clearly.

I think that this would be still the best answer, given the set of answers provided. When it simply says to sum up the word values, most people would think \sum_i v^{\langle i \rangle}. Don’t you agree? Given these plus the ones about the highest/lowest world values, I think the “average” answer would be the best one.

Thanks.

When it simply says to sum up the word values, most people would think \sum_{i}v^{<i>}. Don’t you agree?

Now, I understand the reason why you went to the wrong way. I (,and also all who watched Andrew’s video), would disagree. :slightly_smiling_face:

It says “word values”. Please do not mix up “word value” and “key”/“value” pair in a dictionary. A value v is just the value correspond to a key k.

Please revisit Andrew’s talk. He clearly said as follows.

Then finally, we’re going to take these Softmax values and multiply them with v^1, which is the value for word 1, the value for word 2, and so on, and so these values correspond to that value up there. Finally, we sum it all up.

Hope this clarifies.