I’m very confused about the dimensions of the inputs to the scaled-dot-product-attention function, W4 assignment 1 in DLS course 5.

We are told that the matrices Q, K and V have dimensions \text{seq_len_q} \times \text{depth}, \text{seq_len_k} \times \text{depth} and \text{seq_len_v} \times \text{depth_v} respectively.

Mathematically \text{seq_len_k} must equal \text{seq_len_v} otherwise the final multiplication `tf.matmul(attention_weights, v)`

will fail. In other words every key vector must have a corresponding value vector.

Mathematically the number of q-vectors (i.e. the \text{seq_len_q} number of vectors each of length \text{depth}) being fed into the scaled-dot-product-attention can be *different* to the number of key-vectors (the \text{seq_len_k} number of vectors each of length \text{depth}) - i.e. we are allowed to have \text{seq_len_q} **not** equal to \text{seq_len_k}.

But in the picture used to explain self-attention (see below) and in the lectures we always imagine assigning a query vector ‘q’ to each word, along with key and value vectors. In other words, surely in applications we always have \text{seq_len_q} = \text{seq_len_k}?

If we don’t have \text{seq_len_q} = \text{seq_len_k} what is the intuition? Are we somehow **not assigning** query-vectors to some input words x^{<i>}? I’m very stuck with the interpretation of all of this!

PS: any answers shouldn’t (I don’t think) refer to multi-head - because the question above concerns just one set of Q, K and V and whether or not we have query-vectors associated to each word. Multi-head introduces different Q, K and V (i.e. new parameter matrices that produce different embeddings of the sentence) calculated in parallel which has not bearing on the question here.