Misinformation regarding use of either Query or key to calculate scale for attention score

Deepti_Prasad · January 4, 2026, 11:17pm

In the course 3 week 3, second video title: Attention at 6.20, instructor mentions we can use either query or key to calculate the scale dimensions as both are same, which I believed one need to use dimension of key as that will be used to calculate the attention score in further step to calculate the attention weight. Infact he mentions Q is used to get the dimension value for the scale, though sometimes you will see K used instead.

So I cross checked again first in the course video and the original transformer article(attention is all you need) that does mentions one need to use key dimensions to get scale value and not query dimension, and no where could I find that stated use query dimension for the scale value or instead of key dimensions, even using query one would get the same value. So either I am confused with what instructor @lmoroney mentions or this needs to be updated in the course video.

Course video screenshot with transcripts

Transformers (attention all you need) original paper screenshot

Kindly address this issue either for me or for every future learners who will come across this video.

Regards

DP

lmoroney · January 5, 2026, 4:02pm

You do use Key dimension, as in the paper. I think in this case it was just an offhand comment that you can use either (in this specific case) because they are the same.

Deepti_Prasad · January 5, 2026, 11:49pm

Hello @lmoroney

i already understand that query and key have same dimensions but they differ in their linear projection of the input.

So based on your response, are you staying query could be used for getting the attention score? as they have same dimensions??

Thank you

DP

nicklewis · January 8, 2026, 3:40pm

Queries and keys do come from different linear projections of the input.

That said, for the scaling in scaled dot product attention, you are only using the size of the feature dimension of the vectors that participate in the dot product. In the standard Transformer, the projected Q and K vectors have the same per head feature dimension, so that dimension size is identical whether you read it from Q or from K. So if your goal is simply to retrieve the dimension size for the scale factor, it is mathematically equivalent either way.

Canonically, this is written using K, and you will almost always see it referenced that way. In this lab, the code happens to read that same dimension size from Q instead. Ideally we would label it in the canonical way, but because the value is identical here, it does not change the computation. Laurence calls this out in the video specifically so learners don’t get the impression it is a different formula.

Topic		Replies	Views
Attention Core formula explain Sequence Models week-module-4	2	51	August 16, 2025
Key_dim Multi Head attention Sequence Models coursera-platform	3	629	May 9, 2022
Need to understand scaled_dot_product_attention function in Transformer Sequence Models week-module-4	1	34	July 14, 2025
C5-W4-A1 Understanding dimensions in the scaled-dot-product-attention Sequence Models coursera-platform	2	604	March 23, 2023
Why do we use square root of key dimension for scaling? Attention in Transformers: Concepts and Code in Py	4	142	July 12, 2025

Misinformation regarding use of either Query or key to calculate scale for attention score

Related topics