Misinformation regarding use of either Query or key to calculate scale for attention score

hi @Mubsi,

In the course 3 week 3, second video title: Attention at 6.20, instructor mentions we can use either query or key to calculate the scale dimensions as both are same, which I believed one need to use dimension of key as that will be used to calculate the attention score in further step to calculate the attention weight. Infact he mentions Q is used to get the dimension value for the scale, though sometimes you will see K used instead.

So I cross checked again first in the course video and the original transformer article(attention is all you need) that does mentions one need to use key dimensions to get scale value and not query dimension, and no where could I find that stated use query dimension for the scale value or instead of key dimensions, even using query one would get the same value. So either I am confused with what instructor @lmoroney mentions or this needs to be updated in the course video.

Course video screenshot with transcripts :backhand_index_pointing_down:t2:

Transformers (attention all you need) original paper screenshot :backhand_index_pointing_down:t2:

Kindly address this issue either for me or for every future learners who will come across this video.

Regards

DP

You do use Key dimension, as in the paper. I think in this case it was just an offhand comment that you can use either (in this specific case) because they are the same.

Hello @lmoroney

i already understand that query and key have same dimensions but they differ in their linear projection of the input.

So based on your response, are you staying query could be used for getting the attention score? as they have same dimensions??

Thank you

DP

Queries and keys do come from different linear projections of the input.

That said, for the scaling in scaled dot product attention, you are only using the size of the feature dimension of the vectors that participate in the dot product. In the standard Transformer, the projected Q and K vectors have the same per head feature dimension, so that dimension size is identical whether you read it from Q or from K. So if your goal is simply to retrieve the dimension size for the scale factor, it is mathematically equivalent either way.

Canonically, this is written using K​, and you will almost always see it referenced that way. In this lab, the code happens to read that same dimension size from Q instead. Ideally we would label it in the canonical way, but because the value is identical here, it does not change the computation. Laurence calls this out in the video specifically so learners don’t get the impression it is a different formula.

1 Like