Why do we use square root of key dimension for scaling?

rr11n · February 26, 2025, 7:31am

Hello Folks,

I have a query around a part of the attention equation. The softmax function computes the dot product and then scales it by dividing with the square root of the dimensions in K. Why cant we use cosine similarity instead, which will give us a scaled dot product? Am I missing something here?

Thanks,
RR.

carlosrl · February 26, 2025, 1:49pm

Why not use cosine?
The attention mechanism computes attention scores using the following formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

So, why not use cosine similarity?
Cosine similarity measures the angle between two vectors, defined as:

\text{cosine\_similarity}(Q, K) = \frac{Q \cdot K}{\|Q\| \|K\|}

At first view, cosine similarity might seem like a natural choice because it normalizes the dot product by the magnitudes of the vectors, ensuring that the similarity score lies in the range[−1,1]. However, there are several reasons why cosine similarity is not typically used in attention mechanisms:

Magnitude Information
Softmax requires unbounded Inputs
Computational Efficiency
Empirical Evidence

To summarize:

The scaled dot product is preferred in attention mechanisms because it preserves magnitude information, provides unbounded inputs for softmax, and is computationally efficient.
Cosine similarity, while theoretically appealing, loses magnitude information, constrains the input range for softmax, and introduces additional computational overhead.
Empirical evidence supports the effectiveness of the scaled dot product in a wide range of applications.

In fact, cosine similarity is not used in attention mechanisms because it loses magnitude information, constrains softmax inputs, and introduces computational overhead.

rr11n · February 27, 2025, 8:11am

Thank you Carlos! That helps!

slopez · July 11, 2025, 9:40am

I think the question is still unanswered. Ok, don’t use cosine similarity to keep computation simple and preserve magnitude information. However, you could achieve the same without scaling at all. Why then pick a constant \sqrt{d_k}?

Copy-paste from ChatGPT to this specific question:

Each element of QK^T is a dot product q_i \cdot k_j, where both q_i and k_j are vectors in \mathbb{R}^{d_k}.

If these vectors are roughly unit-variance and uncorrelated (as is often the case early in training), the expected magnitude of the dot product grows with the dimension:

\mathbb{E}[q_i \cdot k_j] \sim \mathcal{O}(d_k)

This means:

As d_k increases, the dot products can become very large in magnitude, especially when Q and K are not normalized.

Why is this a problem?

You pass these dot products into a softmax:

\text{softmax}(q_i \cdot k_j)

If the inputs to softmax are large in magnitude, the softmax becomes very peaky:

One token gets almost all the weight (close to 1)

All other tokens get near-zero weights

This causes:

Hard attention instead of soft attention (bad for training early on)

Vanishing gradients, because softmax becomes too flat (gradient near-zero for almost all elements)

Unstable training, especially with larger models

The Solution: Scaling by \sqrt{d_k}

By dividing the dot products by \sqrt{d_k}, we reduce their magnitude and keep the softmax in a healthy, moderate range:

\frac{q_i \cdot k_j}{\sqrt{d_k}} \sim \mathcal{O}(1)

This ensures:

The softmax doesn’t saturate

Gradients flow better

The attention weights stay meaningfully distributed

slopez · July 12, 2025, 11:01am

Edit: Even better than my previous answer (which I can’t edit), directly from [the paper],(https://arxiv.org/pdf/1706.03762) at the end of section 3.2.1:

We suspect that for large values of d_k , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/\sqrt{d_k}.

.

Topic		Replies	Views
Purpose of sqrt(dim(k)) in Scaled dot product attention NLP with Attention Models week-module-1	3	1383	November 19, 2021
What is the rationale behind square root scaling in attention NLP with Attention Models week-module-1	2	796	September 10, 2023
C4W2_Assignment in Natural Language Processing with Attention NLP with Attention Models week-module-3	2	77	September 2, 2024
Misinformation regarding use of either Query or key to calculate scale for attention score PyTorch: Advanced Architectures and Deployment week-module-3 , dl-ai-learning-platform	3	49	January 8, 2026
Why is simple matmul of embedding vectors describes theirs similarity? Embedding Models: From Architecture to Implementat	36	512	August 13, 2024

Why do we use square root of key dimension for scaling?

Related topics