Why do we use square root of key dimension for scaling?

Hello Folks,

I have a query around a part of the attention equation. The softmax function computes the dot product and then scales it by dividing with the square root of the dimensions in K. Why cant we use cosine similarity instead, which will give us a scaled dot product? Am I missing something here?

Thanks,
RR.

Why not use cosine?
The attention mechanism computes attention scores using the following formula:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

So, why not use cosine similarity?
Cosine similarity measures the angle between two vectors, defined as:

\text{cosine\_similarity}(Q, K) = \frac{Q \cdot K}{\|Q\| \|K\|}

At first view, cosine similarity might seem like a natural choice because it normalizes the dot product by the magnitudes of the vectors, ensuring that the similarity score lies in the range[−1,1]. However, there are several reasons why cosine similarity is not typically used in attention mechanisms:

  • Magnitude Information
  • Softmax requires unbounded Inputs
  • Computational Efficiency
  • Empirical Evidence

To summarize:

  • The scaled dot product is preferred in attention mechanisms because it preserves magnitude information, provides unbounded inputs for softmax, and is computationally efficient.
  • Cosine similarity, while theoretically appealing, loses magnitude information, constrains the input range for softmax, and introduces additional computational overhead.
  • Empirical evidence supports the effectiveness of the scaled dot product in a wide range of applications.

In fact, cosine similarity is not used in attention mechanisms because it loses magnitude information, constrains softmax inputs, and introduces computational overhead.

Thank you Carlos! That helps!