Learning q, k, v in self-attention and multihead attention

In the week 4 video “self-attention” at time 4:24, Dr. Ng says “The way that q^{<3>} is computed is as a learned matrix, which I’m going to write as W^Q times x^{<3>}, and similarly for the key and value pairs.”

So q at each time step is a matrix comprised of a learned weight matrix W^Q times the encoding of the time step or word at x^{<3>}? Wouldn’t it be a vector rather than a matrix?

In the next video, “Multi-Head Attention” at time 1:00, Dr. Ng says "… calculate multiple self-attentions. So the first of these, you multiple the q, k, v matrices with weight matrices W_1^Q, W_1^K, W_1^V..."

He again referred to each q as a matrix, and this time we’re multiplying them by a weight matrix W. So in self attention, q^{<i>} = W^Q x^{<i>}, now we’re using W^Q_jq^{<i>}. This is all very confusing, can someone elaborate whether q^{<i>} is a matrix or a vector, and are the weight matrices W unrelated matrices in the two videos? Are these errors in the videos?

Hello @Alexander_Valarus,

Let’s look at this slide:

Those symbols with " < ? > " as superscript are vectors. However, the capital Q, K, and V are matrices. For example, Q is a matrix and it is also a result of stacking all those q^{<?>} vectors together.

The W’s in the video “Self-Attention” does not have a subscripted number because this video talks about only one Head. The W’s in the video “Multi-Head Attention” have subscripted numbers because the video talks about multi-Head, and for each Head, we have a set of W’s, and thus we use the subscripts to distinguish one Head from another.

Now, back to your question:

I do not hear Andrew said this, and I think Andrew has mentioned quite a few times that q^{< ? >} are vectors.

My interpretation (of what you have quoted there) is as I explained in the above.

As you have quoted, he said “you multiply the k, q, v matrices with weight matrices”, since we have already learn this equation in the last video,


I think we can infer that he meant to talk about the capital Q, K and V which are all matrices, and indeed they are multiplied together in some way.

@Alexander_Valarus, I can imagine that you are trying to stay with Andrew through the two videos, and because you paid close attention to what he said, you asked these questions. I think my response should focus on the meaning of the symbols (which is why I talked about them at the beginning) because after clearing it, we can move on. If we can move on and practice with the labs, then we can become more familiar with it, and at that time, if we come back to those videos again, then you will find them easier to follow. It is usual that we want to watch the same videos a couple of times (I believe you have done so :wink: ), but what’s better is that in between them, we have some different experiences (such as the labs). I therefore also suggest you to go to the lab and try to check out some shapes to verify it for yourself. This is not simple, because you need to read through all the code carefully to identify which part of the code corresponds to which part of the videos.