# Learning q, k, v in self-attention and multihead attention

In the week 4 video “self-attention” at time 4:24, Dr. Ng says “The way that q^{<3>} is computed is as a learned matrix, which I’m going to write as W^Q times x^{<3>}, and similarly for the key and value pairs.”

So q at each time step is a matrix comprised of a learned weight matrix W^Q times the encoding of the time step or word at x^{<3>}? Wouldn’t it be a vector rather than a matrix?

In the next video, “Multi-Head Attention” at time 1:00, Dr. Ng says "… calculate multiple self-attentions. So the first of these, you multiple the q, k, v matrices with weight matrices W_1^Q, W_1^K, W_1^V..."

He again referred to each q as a matrix, and this time we’re multiplying them by a weight matrix W. So in self attention, q^{<i>} = W^Q x^{<i>}, now we’re using W^Q_jq^{<i>}. This is all very confusing, can someone elaborate whether q^{<i>} is a matrix or a vector, and are the weight matrices W unrelated matrices in the two videos? Are these errors in the videos?

Hello @Alexander_Valarus,

Let’s look at this slide:

Those symbols with " < ? > " as superscript are vectors. However, the capital Q, K, and V are matrices. For example, Q is a matrix and it is also a result of stacking all those q^{<?>} vectors together.

The W’s in the video “Self-Attention” does not have a subscripted number because this video talks about only one Head. The W’s in the video “Multi-Head Attention” have subscripted numbers because the video talks about multi-Head, and for each Head, we have a set of W’s, and thus we use the subscripts to distinguish one Head from another.