Relevance of shape of Query tensor Q, K and V


I have some doubt in function scaled_dot_product_attention in programming assignment 1 of Week 4.

Does this function calculates attention output for one-head?

Does Q matrix holds all the queries for one head of all the multi-head attention (one self-attention mechanism)?

Is any dimension of K or V tensor related to number of words in the sentence/sequence. Because for each query number of keys must be equal to number of words in the sequence.

Hi @Faraz !

I think I can help with this questions. Lets go one by one:

Yes, this function is inside each head and calculates the self attention on each head. If you have, say, 8 heads (like the original paper does), then you will be running this function in each head with its own QKV embeddings matrices.

Each head holds its own QKV matrices. If you have 8 heads, then you’ll have 8 QKVs.

The dimension of QKV is equal to d_model // nheads.

d_model is the dimension of the embeddings of the model. In the original paper, this was 512.
nheads is the number of heads in the multi-head attention module.

If d_model = 512 and nheads = 4 then the dimension of each Q, K, and V = 512//4 = 128

And as an additional information: after the attention of each head is calculated, all the outputs are concatenated and you get again a single embedding matrix with a dim = 512 in our previous example.

I hope this solves your question. Otherwise, please ask any follow up questions and I’ll be happy to answer!

Happy learning!


Thanks Juan for a very clear explanation.

I have one more doubt.
In the lecture, the QKV tensors were described such that each word will have one “query, key, value” associated with it.

I was trying to find such correspondence in the QKV tensors in the programming assignment.

Does shape of QKV tensor has nothing to do with sequence length (or the number of tokens the sentence is broken into)?

Hi @Faraz , good question!

Lets work this out from the initial definitions of the input.

When you are designing a transformer, one of the first decision you make is the ‘context’ size. That is, the max number of tokens that the model can handle at the same time. The original paper proposed 512 tokens. GPT-4 has 8196 tokens. This is the max number of tokens that you can input into the model.

Now to your question:
Each token is converted to an embedding, and then a Positional encoding is added to it, and then it goes into the attention module where it will be used to generate a qkv. Each embedding will have its qkv.

If the context is 512 but the input is a phrase that turns out to be 100 tokens, the other 412 tokens will be ‘padded’ may be with zeros,

1 Like

So In programming assignment, in scaled_dot_product_attention function:
seq_length_q of tensor q (length of axis = -2 of q) is the ‘context size’.
Is this correct?

Could you paste here that portion of the assignment, without any of your code? just the code provided originally?

Please find the code snippet as follows:


GRADED FUNCTION scaled_dot_product_attention

def scaled_dot_product_attention(q, k, v, mask):
Calculate the attention weights.
q, k, v must have matching leading dimensions.
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
The mask has different shapes depending on its type(padding or look ahead)
but it must be broadcastable for addition.

    q -- query shape == (..., seq_len_q, depth)
    k -- key shape == (..., seq_len_k, depth)
    v -- value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    output -- attention_weights

matmul_qk = None # (..., seq_len_q, seq_len_k)`

‘output = None # (…, seq_len_q, depth_v)’

It seems to me that ‘seq_len_q’ and the others for _k and _v are already the affected length, meaning:

seq_len_q = d_model // nheads

Check out the call to this function. May be there we are calculating it.

From where the function is called it is not clear. However I printed the q,k,v shape which are are in the snapshot

Thanks @Faraz , can you share with me the exact lesson you are looking at? or even better? share internally the notebook? I’d like to check it out.

Hi @Faraz , regarding the error you are getting in the notebook, please see if you are passing all the parameters required to the multi-head attention. You may be passing only Query, but remember, we need Query, Key and Value.