What is the rationale behind square root scaling in attention

The scaled dot product of Q and K is divided by square root of d.

  1. What is the rationale of choosing d as a factor for scaling?
  2. Why is its square root used to scale ?


I found explanation of why scaling is done using d here, but not why the square root is used:

Hi @Ritu_Pande

That is a good question :+1:

The dot product of Q and K^T increases the variance because multiplying two random numbers increases the variance.

One of the best explanations I have seen is by Andrej Karpathy here in the span of 3 minutes. (note, wei here is equal to Q \cdot K^T).

I would encourage to try it and see for yourself, like:


Thank your for the reference and the details. After going through the video you shared, I also found theoretical proof ( here ) of how dot product of two random variables of length d_k each with variance =1 and mean=0 results in variance = d_k.

Therefore, we have to divide with the s.d. to scale down the distribution.

Thanks for all the help you have been giving me in understanding these concepts.