[Week 4] - Lab - Self Attention


I’m not quite able to understand how the dimensions of q, k and v are decided. From my understanding q and k must have the same dimensions as x however this doesn’t seem to be the case. Any insights would be appreciated.


I assume what you mean about X is the input sequence after embedding, in other words, each X<i> is an embedding vector. Or it’s the output sequence of previous layer if attention layer is in the middle.

The dimension of Q is same as X, but K and V depend on what it’s going to pay attention. For self-attention, Q, K, V and X have the same dimensions, because it attends to itself, e.g., encoder. However, if your network attends to another network, like part of decoder, the dimensions of K and V are same as another network.

Just like Andrew mentioned in the lecture, there are analogies between RNN attention and transformer attention:
From the picture, q is similar to t, and k is similar to t', they are not necessary to have the same dimensions.
BTW, another way to think of the picture is that alpha is probability (weights) distribution, and A is just the weighted sum of v.

1 Like