Question about Embeddings and Cosine Similarity

Hello,

When using the cosine similarity the vectors are normalized, cos(a, b) = dot(a, b) / (norm(a) * norm(b))

1/ It means that basically all embeddings are on an arc with radius 1, which means that the only thing that matters is the proximity on the circle arc. So this kind of proximity, unless materialized through a projection on the arc, is not captured:

2/ reg. separation planes, from what I see in the course all planes pass through (0,0), which means that you can’t have a “triangle” region as highlighted in the course.

I am aware that I am probably making some confusion, but it is not clear at all for me where.

3/ If cosine similarity is used for measuring embeddings, then why the embeddings are not normalized from the get-go? How / when to use the fact that embeddings are not normalized?

Thank you for clarifying! :slight_smile:

Hi @Alex_Gris

Yes, this kind of distribution (like in the picture) of datapoints usually is not normal. It is common in “plain old” linear regression but not in large language models. Especially in high dimensions all features are usually normalized. Some more interesting content on the matter.

Yes, usually this is not the way “seperation” happens. It is usually with rotation matrices (from the Reformer paper):

They are normalized from the get-go (the weights are initialized with normal distribution with mean 0 and variance 1).

Cheers