Attention Core formula explain

HieuMS · August 14, 2025, 4:42am

Hello, I’m learning Transformer architecture and when I do a lab in coursera that related to multihead Attention and I don’t really understand size of “query, key, and value”. The reasons behind this:

I do by tensorflow frameworks and vectors that represented by rows vectors, so the vectors when I calculate q = <X, W_q> and with X size is (batch_size, sequence, vector_emb) , so q has size is (batch_size, sequence, dim_q). It is similar with k (batch_size, sequence, dim_k) and v (batch_size. sequence, dim_v).
However, in assignment dim_q = dim_k = embed_dim at beginning. But in built-in function multiheadattention in tensorflow, dim_k = dim_v. Why?
If dim_q != dim_k. When doing multiply matrix to calculate attention score that not suited.

I need help .

balaji.ambresh · August 16, 2025, 4:03am

Terribly sorry but I’m having a hard time understanding your question. Please rephrase your doubts.

paulinpaloalto · August 16, 2025, 10:16pm

I may be missing the point of which formula you are asking this about and whether you are asking this in the lectures or specifically about one section of the assignment, but I think the relevant one is this from the scaled_dot_product_attention section of the assignment:

Attention(Q,K,V) = \displaystyle softmax\left(\frac {QK^T}{\sqrt{d_k}} + M\right)V

Note that the operation between Q and K is

Q \cdot K^T

We are doing this one sample at a time so the first of the three dimensions vanishes and the remaining dimensions are:

Q is seq_len_q x depth

K is seq_len_k x depth

So the dimensions for the dot product work and you end up with seq_len_q x seq_len_k.

But then you need to take the dot product of that with V which is seq_len_v x depth_v.

So that is why we need to have seq_len_k == seq_len_v in order for that to work.

If you are asking about one of the lectures or a different part of the assignment, please let me know. If it’s a lecture, please show a screenshot or give the time offset.

Topic		Replies	Views
C5W4 Transformers Assignment/MultiHeadAttention & Concern About Q,K and V dimensions Sequence Models coursera-platform	1	676	April 21, 2022
Relevance of shape of Query tensor Q, K and V Sequence Models coursera-platform	10	1297	August 22, 2023
W4, Exercise 3 - scaled_dot_product_attention: what does the dimension of q, k, v mean? Sequence Models coursera-platform	3	1243	June 22, 2022
Scaled_dot_product_attention q, k, and v dimensions not correct Sequence Models coursera-platform	4	469	July 21, 2023
Q,K,V all are same for self attention Sequence Models coursera-platform	5	696	November 19, 2023

Attention Core formula explain

Related topics