C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention

I’m very confused about the dimensions of the inputs to the scaled-dot-product-attention function, W4 assignment 1 in DLS course 5.

We are told that the matrices Q, K and V have dimensions \text{seq_len_q} \times \text{depth}, \text{seq_len_k} \times \text{depth} and \text{seq_len_v} \times \text{depth_v} respectively.

Mathematically \text{seq_len_k} must equal \text{seq_len_v} otherwise the final multiplication tf.matmul(attention_weights, v) will fail. In other words every key vector must have a corresponding value vector.

Mathematically the number of q-vectors (i.e. the \text{seq_len_q} number of vectors each of length \text{depth}) being fed into the scaled-dot-product-attention can be different to the number of key-vectors (the \text{seq_len_k} number of vectors each of length \text{depth}) - i.e. we are allowed to have \text{seq_len_q} not equal to \text{seq_len_k}.

But in the picture used to explain self-attention (see below) and in the lectures we always imagine assigning a query vector ‘q’ to each word, along with key and value vectors. In other words, surely in applications we always have \text{seq_len_q} = \text{seq_len_k}?

If we don’t have \text{seq_len_q} = \text{seq_len_k} what is the intuition? Are we somehow not assigning query-vectors to some input words x^{<i>}? I’m very stuck with the interpretation of all of this!

PS: any answers shouldn’t (I don’t think) refer to multi-head - because the question above concerns just one set of Q, K and V and whether or not we have query-vectors associated to each word. Multi-head introduces different Q, K and V (i.e. new parameter matrices that produce different embeddings of the sentence) calculated in parallel which has not bearing on the question here.

Hi Alastair_Heffernan,

I agree. It would have been good had there been some assert statements in the function to ensure that seq_len_q, seq_len_k, and seq_len_v are equal.

Hi Alastair,

very much thanks for your question! After watching the lecture videos and reading the paper, I was confused in a very similar way like you.

The thing is that—in contrast to all other lectures in the specialization!—in the lectures on Self-Attention, a presentation of the matrices structure and their dimensions is unfortunately completely skipped.

So, the actual source of confusion is that:

  1. the input data (word embeddings) are packed into the input matrix X row-wise.
  2. the matrices Q, K, V are computed as matrix multiplication XW (and not WX as said in the lecture), where W is the respective weights matrix (W^Q, W^K, or W^V, resp.).
  3. the rest gets clear if you keep in mind the two points above.

There is a very detailed, illustrated and mathematically clear and exact explanation here: https://theaisummer.com/self-attention/

Hope it will further help you!