I have a query around a part of the attention equation. The softmax function computes the dot product and then scales it by dividing with the square root of the dimensions in K. Why cant we use cosine similarity instead, which will give us a scaled dot product? Am I missing something here?
At first view, cosine similarity might seem like a natural choice because it normalizes the dot product by the magnitudes of the vectors, ensuring that the similarity score lies in the range[−1,1]. However, there are several reasons why cosine similarity is not typically used in attention mechanisms:
Magnitude Information
Softmax requires unbounded Inputs
Computational Efficiency
Empirical Evidence
To summarize:
The scaled dot product is preferred in attention mechanisms because it preserves magnitude information, provides unbounded inputs for softmax, and is computationally efficient.
Cosine similarity, while theoretically appealing, loses magnitude information, constrains the input range for softmax, and introduces additional computational overhead.
Empirical evidence supports the effectiveness of the scaled dot product in a wide range of applications.
In fact, cosine similarity is not used in attention mechanisms because it loses magnitude information, constrains softmax inputs, and introduces computational overhead.