Why do we use square root of key dimension for scaling?

Hello Folks,

I have a query around a part of the attention equation. The softmax function computes the dot product and then scales it by dividing with the square root of the dimensions in K. Why cant we use cosine similarity instead, which will give us a scaled dot product? Am I missing something here?

Thanks,
RR.

Why not use cosine?
The attention mechanism computes attention scores using the following formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

So, why not use cosine similarity?
Cosine similarity measures the angle between two vectors, defined as:

\text{cosine\_similarity}(Q, K) = \frac{Q \cdot K}{\|Q\| \|K\|}

At first view, cosine similarity might seem like a natural choice because it normalizes the dot product by the magnitudes of the vectors, ensuring that the similarity score lies in the range[−1,1]. However, there are several reasons why cosine similarity is not typically used in attention mechanisms:

  • Magnitude Information
  • Softmax requires unbounded Inputs
  • Computational Efficiency
  • Empirical Evidence

To summarize:

  • The scaled dot product is preferred in attention mechanisms because it preserves magnitude information, provides unbounded inputs for softmax, and is computationally efficient.
  • Cosine similarity, while theoretically appealing, loses magnitude information, constrains the input range for softmax, and introduces additional computational overhead.
  • Empirical evidence supports the effectiveness of the scaled dot product in a wide range of applications.

In fact, cosine similarity is not used in attention mechanisms because it loses magnitude information, constrains softmax inputs, and introduces computational overhead.

Thank you Carlos! That helps!

I think the question is still unanswered. Ok, don’t use cosine similarity to keep computation simple and preserve magnitude information. However, you could achieve the same without scaling at all. Why then pick a constant \sqrt{d_k}?

Copy-paste from ChatGPT to this specific question:

Edit: Even better than my previous answer (which I can’t edit), directly from [the paper],(https://arxiv.org/pdf/1706.03762) at the end of section 3.2.1:

We suspect that for large values of d_k , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/\sqrt{d_k}.

.