Relevance of shape of Query tensor Q, K and V

Faraz · August 14, 2023, 5:51pm

Hi,

I have some doubt in function scaled_dot_product_attention in programming assignment 1 of Week 4.

Does this function calculates attention output for one-head?

Does Q matrix holds all the queries for one head of all the multi-head attention (one self-attention mechanism)?

Is any dimension of K or V tensor related to number of words in the sentence/sequence. Because for each query number of keys must be equal to number of words in the sequence.

Juan_Olano · August 14, 2023, 10:43pm

Hi @Faraz !

I think I can help with this questions. Lets go one by one:

Yes, this function is inside each head and calculates the self attention on each head. If you have, say, 8 heads (like the original paper does), then you will be running this function in each head with its own QKV embeddings matrices.

Each head holds its own QKV matrices. If you have 8 heads, then you’ll have 8 QKVs.

The dimension of QKV is equal to d_model // nheads.

d_model is the dimension of the embeddings of the model. In the original paper, this was 512.
nheads is the number of heads in the multi-head attention module.

If d_model = 512 and nheads = 4 then the dimension of each Q, K, and V = 512//4 = 128

And as an additional information: after the attention of each head is calculated, all the outputs are concatenated and you get again a single embedding matrix with a dim = 512 in our previous example.

I hope this solves your question. Otherwise, please ask any follow up questions and I’ll be happy to answer!

Happy learning!

Juan

Faraz · August 15, 2023, 5:26am

Thanks Juan for a very clear explanation.

I have one more doubt.
In the lecture, the QKV tensors were described such that each word will have one “query, key, value” associated with it.

I was trying to find such correspondence in the QKV tensors in the programming assignment.

Does shape of QKV tensor has nothing to do with sequence length (or the number of tokens the sentence is broken into)?

Juan_Olano · August 15, 2023, 12:13pm

Hi @Faraz , good question!

Lets work this out from the initial definitions of the input.

When you are designing a transformer, one of the first decision you make is the ‘context’ size. That is, the max number of tokens that the model can handle at the same time. The original paper proposed 512 tokens. GPT-4 has 8196 tokens. This is the max number of tokens that you can input into the model.

Now to your question:
Each token is converted to an embedding, and then a Positional encoding is added to it, and then it goes into the attention module where it will be used to generate a qkv. Each embedding will have its qkv.

If the context is 512 but the input is a phrase that turns out to be 100 tokens, the other 412 tokens will be ‘padded’ may be with zeros,

Faraz · August 15, 2023, 5:30pm

So In programming assignment, in scaled_dot_product_attention function:
seq_length_q of tensor q (length of axis = -2 of q) is the ‘context size’.
Is this correct?

Juan_Olano · August 15, 2023, 7:18pm

Could you paste here that portion of the assignment, without any of your code? just the code provided originally?

Faraz · August 16, 2023, 1:28pm

Please find the code snippet as follows:

'# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

GRADED FUNCTION scaled_dot_product_attention

def scaled_dot_product_attention(q, k, v, mask):
“”"
Calculate the attention weights.
q, k, v must have matching leading dimensions.
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
The mask has different shapes depending on its type(padding or look ahead)
but it must be broadcastable for addition.

Arguments:
    q -- query shape == (..., seq_len_q, depth)
    k -- key shape == (..., seq_len_k, depth)
    v -- value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

Returns:
    output -- attention_weights
"""
# START CODE HERE

matmul_qk = None # (..., seq_len_q, seq_len_k)`

…
…
‘output = None # (…, seq_len_q, depth_v)’

Juan_Olano · August 16, 2023, 3:58pm

It seems to me that ‘seq_len_q’ and the others for _k and _v are already the affected length, meaning:

seq_len_q = d_model // nheads

Check out the call to this function. May be there we are calculating it.

Faraz · August 18, 2023, 5:11pm

From where the function is called it is not clear. However I printed the q,k,v shape which are are in the snapshot

Juan_Olano · August 18, 2023, 6:16pm

Thanks @Faraz , can you share with me the exact lesson you are looking at? or even better? share internally the notebook? I’d like to check it out.

Juan_Olano · August 22, 2023, 10:17pm

Hi @Faraz , regarding the error you are getting in the notebook, please see if you are passing all the parameters required to the multi-head attention. You may be passing only Query, but remember, we need Query, Key and Value.

Topic		Replies	Views
C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention Sequence Models	2	586	March 23, 2023
Q,K,V all are same for self attention Sequence Models	5	621	November 19, 2023
Question about multi-head attention Sequence Models	2	619	June 25, 2021
Scaled_dot_product_attention q, k, and v dimensions not correct Sequence Models	4	449	July 21, 2023
Problems Interpreting the Query, Key and Value matrices: NLP with Attention Models week-1	2	844	December 13, 2022

Relevance of shape of Query tensor Q, K and V

GRADED FUNCTION scaled_dot_product_attention

Related topics