I’m very confused about the dimensions of the inputs to the scaled-dot-product-attention function, W4 assignment 1 in DLS course 5.
We are told that the matrices Q, K and V have dimensions \text{seq_len_q} \times \text{depth}, \text{seq_len_k} \times \text{depth} and \text{seq_len_v} \times \text{depth_v} respectively.
Mathematically \text{seq_len_k} must equal \text{seq_len_v} otherwise the final multiplication tf.matmul(attention_weights, v)
will fail. In other words every key vector must have a corresponding value vector.
Mathematically the number of q-vectors (i.e. the \text{seq_len_q} number of vectors each of length \text{depth}) being fed into the scaled-dot-product-attention can be different to the number of key-vectors (the \text{seq_len_k} number of vectors each of length \text{depth}) - i.e. we are allowed to have \text{seq_len_q} not equal to \text{seq_len_k}.
But in the picture used to explain self-attention (see below) and in the lectures we always imagine assigning a query vector ‘q’ to each word, along with key and value vectors. In other words, surely in applications we always have \text{seq_len_q} = \text{seq_len_k}?
If we don’t have \text{seq_len_q} = \text{seq_len_k} what is the intuition? Are we somehow not assigning query-vectors to some input words x^{<i>}? I’m very stuck with the interpretation of all of this!
PS: any answers shouldn’t (I don’t think) refer to multi-head - because the question above concerns just one set of Q, K and V and whether or not we have query-vectors associated to each word. Multi-head introduces different Q, K and V (i.e. new parameter matrices that produce different embeddings of the sentence) calculated in parallel which has not bearing on the question here.