Hi,
I’m not quite able to understand how the dimensions of q, k and v are decided. From my understanding q and k must have the same dimensions as x however this doesn’t seem to be the case. Any insights would be appreciated.
Cheers!
Hi,
I’m not quite able to understand how the dimensions of q, k and v are decided. From my understanding q and k must have the same dimensions as x however this doesn’t seem to be the case. Any insights would be appreciated.
Cheers!
I assume what you mean about X is the input sequence after embedding, in other words, each X<i>
is an embedding vector. Or it’s the output sequence of previous layer if attention layer is in the middle.
The dimension of Q
is same as X, but K
and V
depend on what it’s going to pay attention. For self-attention, Q
, K
, V
and X
have the same dimensions, because it attends to itself, e.g., encoder. However, if your network attends to another network, like part of decoder, the dimensions of K
and V
are same as another network.
Just like Andrew mentioned in the lecture, there are analogies between RNN attention and transformer attention:
From the picture, q
is similar to t
, and k
is similar to t'
, they are not necessary to have the same dimensions.
BTW, another way to think of the picture is that alpha is probability (weights) distribution, and A is just the weighted sum of v
.