W4, Exercise 3 - scaled_dot_product_attention: what does the dimension of q, k, v mean?

larryleguo · October 12, 2021, 4:15am

I get the code done and passed, but I still do not understand the dimensions of q, k, v:
q – query shape == (…, seq_len_q, depth)
k – key shape == (…, seq_len_k, depth)
v – value shape == (…, seq_len_v, depth_v)

What is the depth and depth_v?
Is seq_len_k the number of words in the sentence?
Is seq_len_q the number of questions to ask? How is it selected? Is it a super parameter to tune?
Does the … represent the folds pf multi-head computations?

TMosh · April 21, 2022, 3:35am

Were you able to find answers to your questions?

Jason_Li1 · June 21, 2022, 11:19pm

Could a mentor shed some light on this question? Having the same issue myself…

anon57530071 · June 22, 2022, 12:29am

Here is an overview of an Encoder process.

You can see the process for multi-head attention including scaled dot product attention at the center of this diagram.

Most important variable is “embedding_dim” which is equal to “model dimension”, d. Embedding, positional_encoding, fully connected layers and multi-head attention layers are using this value as one of input/output shape of data. (Another key variable is “seq_len”)
(As of today, Jun 22, there are incorrect description about “embedded_dim”. For example, the shape of “3rd parameter for encoder_layer_out” is described as (batch_size, input_seq_len, fully_connected_dim). But, the last one should be “embedding_dim”.)

“query size”, d_k which is “depth” in this assignment is calculated as follows.

d_k = \frac{d_{model}}{h} = \frac{embedding\_dim}{num\_heads}, \ \ \ \ \ num_heads: number of heads for multi-head attention layer

Answers to questions:

depth and depth_v are “query size” which is calculated by the above equation.
seq_len_q, seq_len_k, and seq_len_v are essentially same, and are equal to seq_len. seq_len is “number of words” + padding.
seq_len_q is equal to seq_len, and is “number of words”+padding. If you look at Q, each row of “seq_len” is represented as q^{<i>}. In this sense, it can be also said that “questions to ask”.
“…” is same as other input that we saw in the past exercises, I suppose. In my chart above, it is for a single sample (sentence), which includes multiple words. But, of course, the system can accept multiple samples at one time. I suppose this represents “number of input sentences”.

As you see, this “scaled_dot_product_attention” is not used in this exercise, since it is included in Keras MultiHeadAttension. That makes learners difficult to understand input/output parameters.

Hope this helps.

Topic		Replies	Views
C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention Sequence Models coursera-platform	2	588	March 23, 2023
Relevance of shape of Query tensor Q, K and V Sequence Models coursera-platform	10	1173	August 22, 2023
Scaled_dot_product_attention q, k, and v dimensions not correct Sequence Models coursera-platform	4	451	July 21, 2023
W2A1 Decoder Layer and its test case NLP with Attention Models week-2	5	15	April 6, 2025
C5W4: dk in scaled dot product attention Sequence Models coursera-platform	1	887	June 28, 2021

W4, Exercise 3 - scaled_dot_product_attention: what does the dimension of q, k, v mean?

Related topics