What is the rationale behind square root scaling in attention

Ritu_Pande · September 10, 2023, 12:49am

The scaled dot product of Q and K is divided by square root of d.

What is the rationale of choosing d as a factor for scaling?
Why is its square root used to scale ?

EDIT:

I found explanation of why scaling is done using d here, but not why the square root is used:

arvyzukai · September 10, 2023, 7:19am

Hi @Ritu_Pande

That is a good question

The dot product of Q and K^T increases the variance because multiplying two random numbers increases the variance.

One of the best explanations I have seen is by Andrej Karpathy here in the span of 3 minutes. (note, wei here is equal to Q \cdot K^T).

I would encourage to try it and see for yourself, like:

Cheers

Ritu_Pande · September 10, 2023, 5:25pm

Thank your for the reference and the details. After going through the video you shared, I also found theoretical proof ( here ) of how dot product of two random variables of length d_k each with variance =1 and mean=0 results in variance = d_k.

Therefore, we have to divide with the s.d. to scale down the distribution.

Thanks for all the help you have been giving me in understanding these concepts.

Topic		Replies	Views
Purpose of sqrt(dim(k)) in Scaled dot product attention NLP with Attention Models week-module-1	3	1109	November 19, 2021
[Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)? Sequence Models coursera-platform	2	810	August 13, 2021
Scaled_dot_product_attention Sequence Models coursera-platform	1	786	June 4, 2021
Why do we use square root of key dimension for scaling? Attention in Transformers: Concepts and Code in Py	4	61	July 12, 2025
Intuition reagarding why output of "scaled-dot product" attention represents similarity between tokens NLP with Attention Models course-related , week-module-2 , conceptual-question	1	226	May 1, 2024

What is the rationale behind square root scaling in attention

Related topics