Learning q, k, v in self-attention and multihead attention

Alexander_Valarus · January 25, 2023, 8:46pm

In the week 4 video “self-attention” at time 4:24, Dr. Ng says “The way that q^{<3>} is computed is as a learned matrix, which I’m going to write as W^Q times x^{<3>}, and similarly for the key and value pairs.”

So q at each time step is a matrix comprised of a learned weight matrix W^Q times the encoding of the time step or word at x^{<3>}? Wouldn’t it be a vector rather than a matrix?

In the next video, “Multi-Head Attention” at time 1:00, Dr. Ng says "… calculate multiple self-attentions. So the first of these, you multiple the q, k, v matrices with weight matrices W_1^Q, W_1^K, W_1^V..."

He again referred to each q as a matrix, and this time we’re multiplying them by a weight matrix W. So in self attention, q^{<i>} = W^Q x^{<i>}, now we’re using W^Q_jq^{<i>}. This is all very confusing, can someone elaborate whether q^{<i>} is a matrix or a vector, and are the weight matrices W unrelated matrices in the two videos? Are these errors in the videos?

rmwkwok · January 26, 2023, 6:02am

Hello @Alexander_Valarus,

Let’s look at this slide:

Those symbols with " < ? > " as superscript are vectors. However, the capital Q, K, and V are matrices. For example, Q is a matrix and it is also a result of stacking all those q^{<?>} vectors together.

The W’s in the video “Self-Attention” does not have a subscripted number because this video talks about only one Head. The W’s in the video “Multi-Head Attention” have subscripted numbers because the video talks about multi-Head, and for each Head, we have a set of W’s, and thus we use the subscripts to distinguish one Head from another.

Now, back to your question:

I do not hear Andrew said this, and I think Andrew has mentioned quite a few times that q^{< ? >} are vectors.

My interpretation (of what you have quoted there) is as I explained in the above.

As you have quoted, he said “you multiply the k, q, v matrices with weight matrices”, since we have already learn this equation in the last video,

I think we can infer that he meant to talk about the capital Q, K and V which are all matrices, and indeed they are multiplied together in some way.

@Alexander_Valarus, I can imagine that you are trying to stay with Andrew through the two videos, and because you paid close attention to what he said, you asked these questions. I think my response should focus on the meaning of the symbols (which is why I talked about them at the beginning) because after clearing it, we can move on. If we can move on and practice with the labs, then we can become more familiar with it, and at that time, if we come back to those videos again, then you will find them easier to follow. It is usual that we want to watch the same videos a couple of times (I believe you have done so ), but what’s better is that in between them, we have some different experiences (such as the labs). I therefore also suggest you to go to the lab and try to check out some shapes to verify it for yourself. This is not simple, because you need to read through all the code carefully to identify which part of the code corresponds to which part of the videos.

Cheers,
Raymond

Topic		Replies	Views
C5W4 Transformer multi-head weight matrices Sequence Models coursera-platform	4	825	June 30, 2022
Is there an additional weight matrix layer for K,Q and V Sequence Models coursera-platform	9	427	August 16, 2023
C5 W4 multi-head attention Sequence Models coursera-platform	7	278	January 2, 2024
Course 5 Week 4 - Transformer Networks mechanics Sequence Models coursera-platform	1	508	April 21, 2022
Course 5 - Week 4 - A1 - Exercise 4 - EncoderLayer Sequence Models week-module-4 , coursera-platform	2	40	August 13, 2024

Learning q, k, v in self-attention and multihead attention

Related topics