W4 A1 | Is there a typo in Multi-head attention slides?

The notation I am going to use is as follows:

  • In context of self-attention (first image below):
    • Superscript in angular brackets means the 'ith word in the input sequence. for e.g. in q<i>, k<i>, v<i> and A<i> the superscript ‘i’ represents the ‘ith’ word of the input sequence X:{x1,x2,x3…xTX}

    • In self-attention video, q<i> = WQ . x<i>. Similarly. k and v were defined as WK.x<i> and WV.x<i>, simultaneously.
      K and WV are same for all x<i>.

    • Can someone confirm?

Now, moving on to multi-head attention, I am going to use subscript, ‘h’ for each head. Translating it to q, k, v, if we have 64 heads, the Q matrix would have {q1,q2,q3,…,qh,…,q64} queries. Similarly K = {k1,k2,k3,…,kh,…,k64} and V = {v1,v2,v3,…,vh,…,v64}

I understand that for each question qh we have a weight matrix WQh. i.e, for q1 we have WQ1 and so on.

But on the slide below, it I am unable to understand WQ1.q<1>. (highlighted in blue circles below)

Does it mean we take the q<1> and already computed using WQ. x<i> in self-attention step and then multiply it with a new matrix WQ1.? or it is a typo mistake. Like, in the blue circles above, instead of
q<1> shouldn’t it be x<1>

Later, when Andrew says till this step it is your normal self attention that you saw previously, adds more to the confusion, as the equations in blue boxes doesn’t line up with equations highlighted with teal above.

2 Likes

OK, I suppose you already catch some key points. Let’s start in a reverse order.
As notations are slight complex, I tried to write down a whole picture of Transformer Encorder with using the same parameter name as what we learn in Jupyter notebook. (Note that I used the definition of the original paper for Q/K/V and weights for those, the order of Q and W is different from Andrew’s chart. But, it a matter of transpose. Do not worry about that portion.

You see multi-head attention function in the center of this figure. As it may be slightly small, I will paste another image which focuses on the multi-head attention only later.

In multi-header attention layer there are 3 steps as below.

  1. Linear operation (dot product of inputs and weights), and dispatch queries, keys and values to appropriate "head"s.
  2. (In each head) scaled dot product attention to calculate attention scores (with Softmax). This is a parallel operation to distribute tasks to multiple heads to work separately. (A big difference from RNN.)
  3. Concatenate outputs from all heads, and calculate the “dot product” of this concatenated output and W_0 which is another weight for concatenated output.

Then, going through a fully connected layer, we get updated X in here. Then, this goes into the encoder layer (multi-head attention layer) again.
The key point is, for “self-attention”, X is used for Q, K and V. Yes, inputs are same. In this sense, q^{<1>} is same as the first word vector (+positional encoding) in X.

Then, we separate Q\cdot W^Q into small queries. (same to keys and values.)

So, weights for W_1^Q, W_2^Q, .. are not applied to q^{<1>}, q^{<>2}, .. yet. That is an operation inside “multi-head attention”.
In this sense, Andew’s chart for multi-head attention is correct. (of course, assuming that my chart is correct… :slight_smile: )

Then, the next discussion is about the Self-attention. Apparently, q^{<3>}, k^{<3>}, .. are “weighted”. In this sense, as you point out, this may not be inconsistent to “multi-head attention”.

My interpretation is, this is part of “Self-Attention Intuition” to explain how queries, keys and values works together. (excluding weights which need another discussion.)

In net, I think you understand correctly, and also I understand your points. Please consider a chart and explanation for “self-attention” are for intuition.

4 Likes

Thank you for all this. Would you mind summarizing the bottom line of this answer? I.e. is it a typo, or is there a self-attention computation happening as a first step/layer (which then provides the q’s , k’s and v’s that the W’s are multiplied with in the multi-head step)? Or is there another explanation for where the q’s, k’s and v’s in multihead attention come from?

I suppose Andrew might intentionally remove weights, since it is not major part of his “intuition” talk. I do not know his intention though.

But, from mathematically, it is wrong. Not typo, but wrong, I think.
If you go through my explanation, you should know that. :slight_smile:

1 Like

Thank you. After going through the Attention Is All You Need paper, I think Andrew’s formulas are correct, but he just doesn’t explain how he goes from multiplying by x<1> to multiplying by q<1>, k<1> and v<1>. I suppose it’s because in the paper, there’s a dimension reduction step where the embeddings matrix is reduced from embedding size to query size, as you call it.

So I think the multi-head attention slide is correct, but he skips a step (maybe on purpose like you said) in the explanation, and the notation is very confusing as a result.

1 Like

So I think the multi-head attention slide is correct, but he skips a step (maybe on purpose like you said) in the explanation, and the notation is very confusing as a result.

That’s right. And, that is the reason why I started to explain from multi-head attention first. Self-attention is confusing, once we understand the whole mechanism.

I had the same question, and other people before us on this forum too.

I think that the confusion comes from the fact that coming from the self-attention video, we think of the vectors q<i>=Wq*x<i> as “queries”, like if it were asking one specific question (What? When? …). While in fact, from what I read here and on other threads, I understand that q<i> vectors a merely thought of as “query-like encodings” and then, each W1Q*q<i>, W2Q*q<i> etc… can be viewed as a one specific question (the same question for every i, given a WjQ matrix.

Can anyone validate this interpretation, or point out where it would be wrong?

Thanks!

I think confusions come from Andrew’ video. That’s “intuitions” to talk about the concept to be easily understandable. That’s a great contents to lower the hurdle for the deep learning.
Once, you successfully grasp the concepts, then, I recommend to go through the original paper to under stand its architecture and implementation.

The paper says that,

  • Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
  • An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors

The above clearly describes what Transformer is doing, i.e., creating relationships among words in a single sentence (self-attention), and creating relationships of words in a source sentence and a target sentence (attention in Decorder).
For “intuitions”, Andrews used “question”, but it means “relation”.

Here is the basic definitions.

  • q^{<i>} is the i-th word (as a vector) in a sentence, and consists of “word vector” and “position encoding”.
  • q^{<i>}W_h^Q is a weighted vector for MHA h, and consists of “word vector” and “position encoding”. That represents a single word.

And, same to k, and v for self-attention. (In the case of “attention” in Decoder which translates English-French, q comes from a self-attention in a decoder, and k,v comes from Encoder.)

What we want to calculate is which k^{<i>}W_h^K has a strong relationship with a given q^{<i>}W_h^Q. So, it can be said as a “question” (a single word) as you wrote, but, actually “relationship”. So, it’s a probability distribution in k to show how each word is related to q^{<i>}W_h^Q.

Then, back to your question, the difference between q^{<i>} and q^{<i>}W_h^Q is “weighted” or not (and separated for each head). Both represent one word in a sentence. It may be viewed as a question (again, single word) for k, but, it is to calculate a probability distribution in k^{<i>}W_h^K to be associated with q^{<i>}W_h^Q.

As the next step, my recommendation is to go through an original paper.
“Intuitions” is just for grasping an overview. Even Andrew, it is difficult to go in detail with a limited time.

If you have further questions after reading a paper, I’m happy to discuss with you in this forum.

Nice - I haven’t read the paper yet, but based on the discussion above and your comment about ‘dimension reduction’ - I think essentially the encoder reduces dimensionality of input vector from (say) N dimensional space to H (= number of heads) dimensional space. Where each H is like a principal component of that N-dimensional space, trying to capture one or a combination of latent features from that N-dimensional space. This also makes sense as to why only X matrix is needed.

Based on your graph, seems that the intuition in “Self attention” lecture is correct. The “Multi-head attention” lecture is a bit misleading because actually q^{<1>}W^{Q}_1 = x^{<1>}W^{Q}_1. From my understanding, W_h^Q, W_h^K, W_h^V are weights related to a class of similar questions like questions about actions and q^{<1>}W^{Q}_h is a more specific question like actions related to Paris.