Hi,

I’m not quite able to understand how the dimensions of q, k and v are decided. From my understanding q and k must have the same dimensions as x however this doesn’t seem to be the case. Any insights would be appreciated.

Cheers!

Hi,

I’m not quite able to understand how the dimensions of q, k and v are decided. From my understanding q and k must have the same dimensions as x however this doesn’t seem to be the case. Any insights would be appreciated.

Cheers!

I assume what you mean about X is the input sequence after embedding, in other words, each `X<i>`

is an embedding vector. Or it’s the output sequence of previous layer if attention layer is in the middle.

The dimension of `Q`

is same as X, but `K`

and `V`

depend on what it’s going to pay attention. For self-attention, `Q`

, `K`

, `V`

and `X`

have the same dimensions, because it attends to itself, e.g., encoder. However, if your network attends to another network, like part of decoder, the dimensions of `K`

and `V`

are same as another network.

Just like Andrew mentioned in the lecture, there are analogies between RNN attention and transformer attention:

From the picture, `q`

is similar to `t`

, and `k`

is similar to `t'`

, they are not necessary to have the same dimensions.

BTW, another way to think of the picture is that *alpha* is probability (weights) distribution, and *A* is just the weighted sum of `v`

.

1 Like