W4 A1 | Is there a typo in Multi-head attention slides?

Hardik_Modi2 · June 7, 2022, 7:07am

The notation I am going to use is as follows:

In context of self-attention (first image below):
- Superscript in angular brackets means the 'i^th word in the input sequence. for e.g. in q^, k^, v^ and A^ the superscript ‘i’ represents the ‘i^th’ word of the input sequence X:{x₁,x₂,x₃…x_{T_X}}
- In self-attention video, q^ = W^Q . x^. Similarly. k and v were defined as W^K.x^ and W^V.x^, simultaneously.
 K and W^V are same for all x^.
- Can someone confirm?

Now, moving on to multi-head attention, I am going to use subscript, ‘h’ for each head. Translating it to q, k, v, if we have 64 heads, the Q matrix would have {q₁,q₂,q₃,…,q_h,…,q₆₄} queries. Similarly K = {k₁,k₂,k₃,…,k_h,…,k₆₄} and V = {v₁,v₂,v₃,…,v_h,…,v₆₄}

I understand that for each question q_h we have a weight matrix W^Q_h. i.e, for q₁ we have W^Q₁ and so on.

But on the slide below, it I am unable to understand W^Q₁.q^<1>. (highlighted in blue circles below)

Does it mean we take the q^<1> and already computed using W^Q. x^ in self-attention step and then multiply it with a new matrix W^Q₁.? or it is a typo mistake. Like, in the blue circles above, instead of
q^<1> shouldn’t it be x^<1>

Later, when Andrew says till this step it is your normal self attention that you saw previously, adds more to the confusion, as the equations in blue boxes doesn’t line up with equations highlighted with teal above.

anon57530071 · June 9, 2022, 7:38am

OK, I suppose you already catch some key points. Let’s start in a reverse order.
As notations are slight complex, I tried to write down a whole picture of Transformer Encorder with using the same parameter name as what we learn in Jupyter notebook. (Note that I used the definition of the original paper for Q/K/V and weights for those, the order of Q and W is different from Andrew’s chart. But, it a matter of transpose. Do not worry about that portion.

You see multi-head attention function in the center of this figure. As it may be slightly small, I will paste another image which focuses on the multi-head attention only later.

In multi-header attention layer there are 3 steps as below.

Linear operation (dot product of inputs and weights), and dispatch queries, keys and values to appropriate "head"s.
(In each head) scaled dot product attention to calculate attention scores (with Softmax). This is a parallel operation to distribute tasks to multiple heads to work separately. (A big difference from RNN.)
Concatenate outputs from all heads, and calculate the “dot product” of this concatenated output and W_0 which is another weight for concatenated output.

Then, going through a fully connected layer, we get updated X in here. Then, this goes into the encoder layer (multi-head attention layer) again.
The key point is, for “self-attention”, X is used for Q, K and V. Yes, inputs are same. In this sense, q^{<1>} is same as the first word vector (+positional encoding) in X.

Then, we separate Q\cdot W^Q into small queries. (same to keys and values.)

So, weights for W_1^Q, W_2^Q, .. are not applied to q^{<1>}, q^{<>2}, .. yet. That is an operation inside “multi-head attention”.
In this sense, Andew’s chart for multi-head attention is correct. (of course, assuming that my chart is correct… )

Then, the next discussion is about the Self-attention. Apparently, q^{<3>}, k^{<3>}, .. are “weighted”. In this sense, as you point out, this may not be inconsistent to “multi-head attention”.

My interpretation is, this is part of “Self-Attention Intuition” to explain how queries, keys and values works together. (excluding weights which need another discussion.)

In net, I think you understand correctly, and also I understand your points. Please consider a chart and explanation for “self-attention” are for intuition.

claravdw · June 30, 2022, 9:37am

Thank you for all this. Would you mind summarizing the bottom line of this answer? I.e. is it a typo, or is there a self-attention computation happening as a first step/layer (which then provides the q’s , k’s and v’s that the W’s are multiplied with in the multi-head step)? Or is there another explanation for where the q’s, k’s and v’s in multihead attention come from?

anon57530071 · June 30, 2022, 9:43am

I suppose Andrew might intentionally remove weights, since it is not major part of his “intuition” talk. I do not know his intention though.

But, from mathematically, it is wrong. Not typo, but wrong, I think.
If you go through my explanation, you should know that.

claravdw · July 4, 2022, 11:12am

Thank you. After going through the Attention Is All You Need paper, I think Andrew’s formulas are correct, but he just doesn’t explain how he goes from multiplying by x<1> to multiplying by q<1>, k<1> and v<1>. I suppose it’s because in the paper, there’s a dimension reduction step where the embeddings matrix is reduced from embedding size to query size, as you call it.

So I think the multi-head attention slide is correct, but he skips a step (maybe on purpose like you said) in the explanation, and the notation is very confusing as a result.

anon57530071 · July 4, 2022, 11:20am

So I think the multi-head attention slide is correct, but he skips a step (maybe on purpose like you said) in the explanation, and the notation is very confusing as a result.

That’s right. And, that is the reason why I started to explain from multi-head attention first. Self-attention is confusing, once we understand the whole mechanism.

Goudout · July 24, 2022, 10:24am

I had the same question, and other people before us on this forum too.

I think that the confusion comes from the fact that coming from the self-attention video, we think of the vectors q=Wq*x as “queries”, like if it were asking one specific question (What? When? …). While in fact, from what I read here and on other threads, I understand that q vectors a merely thought of as “query-like encodings” and then, each W1Q*q, W2Q*q etc… can be viewed as a one specific question (the same question for every i, given a WjQ matrix.

Can anyone validate this interpretation, or point out where it would be wrong?

Thanks!

anon57530071 · July 24, 2022, 12:13pm

I think confusions come from Andrew’ video. That’s “intuitions” to talk about the concept to be easily understandable. That’s a great contents to lower the hurdle for the deep learning.
Once, you successfully grasp the concepts, then, I recommend to go through the original paper to under stand its architecture and implementation.

The paper says that,

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors

The above clearly describes what Transformer is doing, i.e., creating relationships among words in a single sentence (self-attention), and creating relationships of words in a source sentence and a target sentence (attention in Decorder).
For “intuitions”, Andrews used “question”, but it means “relation”.

Here is the basic definitions.

q^{} is the i-th word (as a vector) in a sentence, and consists of “word vector” and “position encoding”.
q^{}W_h^Q is a weighted vector for MHA h, and consists of “word vector” and “position encoding”. That represents a single word.

And, same to k, and v for self-attention. (In the case of “attention” in Decoder which translates English-French, q comes from a self-attention in a decoder, and k,v comes from Encoder.)

What we want to calculate is which k^{}W_h^K has a strong relationship with a given q^{}W_h^Q. So, it can be said as a “question” (a single word) as you wrote, but, actually “relationship”. So, it’s a probability distribution in k to show how each word is related to q^{}W_h^Q.

Then, back to your question, the difference between q^{} and q^{}W_h^Q is “weighted” or not (and separated for each head). Both represent one word in a sentence. It may be viewed as a question (again, single word) for k, but, it is to calculate a probability distribution in k^{}W_h^K to be associated with q^{}W_h^Q.

As the next step, my recommendation is to go through an original paper.
“Intuitions” is just for grasping an overview. Even Andrew, it is difficult to go in detail with a limited time.

If you have further questions after reading a paper, I’m happy to discuss with you in this forum.

Uditgt · August 10, 2022, 2:10am

Nice - I haven’t read the paper yet, but based on the discussion above and your comment about ‘dimension reduction’ - I think essentially the encoder reduces dimensionality of input vector from (say) N dimensional space to H (= number of heads) dimensional space. Where each H is like a principal component of that N-dimensional space, trying to capture one or a combination of latent features from that N-dimensional space. This also makes sense as to why only X matrix is needed.

Zhihan_Zhang · November 10, 2022, 3:46am

Based on your graph, seems that the intuition in “Self attention” lecture is correct. The “Multi-head attention” lecture is a bit misleading because actually q^{<1>}W^{Q}_1 = x^{<1>}W^{Q}_1. From my understanding, W_h^Q, W_h^K, W_h^V are weights related to a class of similar questions like questions about actions and q^{<1>}W^{Q}_h is a more specific question like actions related to Paris.

Topic		Replies	Views
C5W4 Transformer multi-head weight matrices Sequence Models coursera-platform	4	825	June 30, 2022
How is Self Attention Q=Wx related to multi-head attention WQ Sequence Models coursera-platform	8	543	October 18, 2022
Learning q, k, v in self-attention and multihead attention Sequence Models coursera-platform	1	572	January 26, 2023
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	508	April 21, 2022
Multi-Head Attention - Question about the slide Sequence Models week-module-4 , coursera-platform	1	143	May 13, 2024

W4 A1 | Is there a typo in Multi-head attention slides?

Related topics