Purpose of sqrt(dim(k)) in Scaled dot product attention

Matt_Stults · November 11, 2021, 6:29am

I’m looking at Scaled Dot-Product Attention: Ungraded Lab

I think I understand the reason for all of the operations in Attention(Q,K,V) except for the division by the sqrt(dimension of embeddings in K).

The sqrt(dimension of embeddings in K) division seems to reduce the magnitude of difference for each comparison going into the softmax, and so I do understand that the division impacts the resulting weighting. But why division by this specific quantity? It would make some intuitive sense–though I still wouldn’t understand why it was necessary–if we were dividing by dimension of embeddings in K without the sqrt because each value in the QK^T matrix is the sum of K multiplications. But the sqrt makes me think I’m going in the wrong direction with this line of thinking.

Thanks in advance for your patience.

darivadi · November 17, 2021, 3:06pm

Hi @Matt_Stults .

Regarding the sqrt(dim_k) you can refer to the paper of Attention is all you need where they compare an additive attention with a dot-product attention. Basically the additive attention outperforms the dot-product attention if an scaling factor is not used. And also they mention that for lage values of dim_k the dot products can grow large in magnitude affecting the softmax function and the corresponding gradients. So the scaling factor sqrt(dim_k) can be viewed as a way to avoid the vanishing gradient problem.

Hope it helps.

Matt_Stults · November 19, 2021, 3:09am

Thanks! I had been thinking of this like a typical mathematical formula where each piece has a deep and specific meaning about the relationship of the parts. For example, I was looking for some deeper meaning to the fact that it’s a square root vs a cube root. My interpretation from your answer is that this is more of a pragmatic choice: sqrt(dim_k) just works better in practice than the alternatives tried. Does this sound correct or am I likely still just missing the meaning of this term?

darivadi · November 19, 2021, 2:40pm

Hi @Matt_Stults . You are right. The use of the term sqrt(dim_k) is more like a heuristic proposition to ensure that the argument of the softmax function will not grow too much so that the softmax won’t reach asymptotic values where the gradient may vanish.

Topic		Replies	Views
What is the rationale behind square root scaling in attention NLP with Attention Models week-1	2	563	September 10, 2023
[Week 4]Exercise 5 - Encoder. Why need to scale the embedding by sqrt(d)? Sequence Models	2	802	August 13, 2021
Scaled_dot_product_attention Sequence Models	1	786	June 4, 2021
Why do we use square root of key dimension for scaling? Attention in Transformers: Concepts and Code in Py	2	41	February 27, 2025
Understanding of Scaled Dot-Product Attention with math NLP with Attention Models week-2	3	430	July 29, 2023

Purpose of sqrt(dim(k)) in Scaled dot product attention

Related topics