C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention

Alastair_Heffernan · March 5, 2022, 12:44pm

I’m very confused about the dimensions of the inputs to the scaled-dot-product-attention function, W4 assignment 1 in DLS course 5.

We are told that the matrices Q, K and V have dimensions \text{seq_len_q} \times \text{depth}, \text{seq_len_k} \times \text{depth} and \text{seq_len_v} \times \text{depth_v} respectively.

Mathematically \text{seq_len_k} must equal \text{seq_len_v} otherwise the final multiplication tf.matmul(attention_weights, v) will fail. In other words every key vector must have a corresponding value vector.

Mathematically the number of q-vectors (i.e. the \text{seq_len_q} number of vectors each of length \text{depth}) being fed into the scaled-dot-product-attention can be different to the number of key-vectors (the \text{seq_len_k} number of vectors each of length \text{depth}) - i.e. we are allowed to have \text{seq_len_q} not equal to \text{seq_len_k}.

But in the picture used to explain self-attention (see below) and in the lectures we always imagine assigning a query vector ‘q’ to each word, along with key and value vectors. In other words, surely in applications we always have \text{seq_len_q} = \text{seq_len_k}?

If we don’t have \text{seq_len_q} = \text{seq_len_k} what is the intuition? Are we somehow not assigning query-vectors to some input words x^{<i>}? I’m very stuck with the interpretation of all of this!

PS: any answers shouldn’t (I don’t think) refer to multi-head - because the question above concerns just one set of Q, K and V and whether or not we have query-vectors associated to each word. Multi-head introduces different Q, K and V (i.e. new parameter matrices that produce different embeddings of the sentence) calculated in parallel which has not bearing on the question here.

reinoudbosch · May 10, 2022, 5:02pm

Hi Alastair_Heffernan,

I agree. It would have been good had there been some assert statements in the function to ensure that seq_len_q, seq_len_k, and seq_len_v are equal.

Evgeny_Kruglov · March 23, 2023, 4:22pm

Hi Alastair,

very much thanks for your question! After watching the lecture videos and reading the paper, I was confused in a very similar way like you.

The thing is that—in contrast to all other lectures in the specialization!—in the lectures on Self-Attention, a presentation of the matrices structure and their dimensions is unfortunately completely skipped.

So, the actual source of confusion is that:

the input data (word embeddings) are packed into the input matrix X row-wise.
the matrices Q, K, V are computed as matrix multiplication XW (and not WX as said in the lecture), where W is the respective weights matrix (W^Q, W^K, or W^V, resp.).
the rest gets clear if you keep in mind the two points above.

There is a very detailed, illustrated and mathematically clear and exact explanation here: https://theaisummer.com/self-attention/

Hope it will further help you!

Topic		Replies	Views
W4, Exercise 3 - scaled_dot_product_attention: what does the dimension of q, k, v mean? Sequence Models	3	1213	June 22, 2022
Relevance of shape of Query tensor Q, K and V Sequence Models	10	1145	August 22, 2023
Having trouble understanding the Attention Layer NLP with Attention Models week-1	6	566	December 6, 2022
[Week 4] - Lab - Self Attention Sequence Models	1	625	June 4, 2021
Scaled_dot_product_attention q, k, and v dimensions not correct Sequence Models	4	451	July 21, 2023

C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention

Related topics