I have some doubt in function scaled_dot_product_attention in programming assignment 1 of Week 4.

Does this function calculates attention output for one-head?

Does Q matrix holds all the queries for one head of all the multi-head attention (one self-attention mechanism)?

Is any dimension of K or V tensor related to number of words in the sentence/sequence. Because for each query number of keys must be equal to number of words in the sequence.

I think I can help with this questions. Lets go one by one:

Yes, this function is inside each head and calculates the self attention on each head. If you have, say, 8 heads (like the original paper does), then you will be running this function in each head with its own QKV embeddings matrices.

Each head holds its own QKV matrices. If you have 8 heads, then you’ll have 8 QKVs.

The dimension of QKV is equal to d_model // nheads.

d_model is the dimension of the embeddings of the model. In the original paper, this was 512.
nheads is the number of heads in the multi-head attention module.

If d_model = 512 and nheads = 4 then the dimension of each Q, K, and V = 512//4 = 128

And as an additional information: after the attention of each head is calculated, all the outputs are concatenated and you get again a single embedding matrix with a dim = 512 in our previous example.

I hope this solves your question. Otherwise, please ask any follow up questions and I’ll be happy to answer!

Lets work this out from the initial definitions of the input.

When you are designing a transformer, one of the first decision you make is the ‘context’ size. That is, the max number of tokens that the model can handle at the same time. The original paper proposed 512 tokens. GPT-4 has 8196 tokens. This is the max number of tokens that you can input into the model.

Now to your question:
Each token is converted to an embedding, and then a Positional encoding is added to it, and then it goes into the attention module where it will be used to generate a qkv. Each embedding will have its qkv.

If the context is 512 but the input is a phrase that turns out to be 100 tokens, the other 412 tokens will be ‘padded’ may be with zeros,

So In programming assignment, in scaled_dot_product_attention function: seq_length_q of tensor q (length of axis = -2 of q) is the ‘context size’.
Is this correct?

def scaled_dot_product_attention(q, k, v, mask):
“”"
Calculate the attention weights.
q, k, v must have matching leading dimensions.
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
The mask has different shapes depending on its type(padding or look ahead)
but it must be broadcastable for addition.

Arguments:
q -- query shape == (..., seq_len_q, depth)
k -- key shape == (..., seq_len_k, depth)
v -- value shape == (..., seq_len_v, depth_v)
mask: Float tensor with shape broadcastable
to (..., seq_len_q, seq_len_k). Defaults to None.
Returns:
output -- attention_weights
"""
# START CODE HERE
matmul_qk = None # (..., seq_len_q, seq_len_k)`

Hi @Faraz , regarding the error you are getting in the notebook, please see if you are passing all the parameters required to the multi-head attention. You may be passing only Query, but remember, we need Query, Key and Value.