Why is simple matmul of embedding vectors describes theirs similarity?

It was, like yours, just a toy example. This is valid for any matrix. If you compare them for similarity, the need to have the same magnitude. Usually I would expect a magnitude of 1.

I don’t know about your example above which has a magnitude of 0.8. Probably because you only show the first 5 values. It’s likely the full vector has magnitude of 1.

My toy example both vectors had a magnitude of 2.

Matmul is just a bunch of dot products, but the vectors should be normalized beforehand for matmul to give us cosine similarity, which, from the above screenshot, hasn’t been done.

:thinking:

1 Like

I didn’t do the course yet, forgive me if I understand it wrong.

Basically, we are discussing how close the vectors are.
Cosine similarity is much simpler to understand when the vectors modules are equal to 1.
In this case, the cosine similarity
(a . b . cos (𝜃))
will be reduced to cos(𝜃) that is, magically, the product of the matrices.

Let’s get 2 vectors in x, y bidimensional space:

A = [x1, y1] at an horizontal angle of 𝜃1

B = [x2, y2] at an horizontal angle of 𝜃2

if they are normalized, the cosine similarity or dot product between them will be

∣A∣ * ∣B∣ * cos(𝜃3)

where 𝜃3 = 𝜃2 - 𝜃1

and ∣A∣ = ∣B∣ = 1

but

cos( 𝜃2 - 𝜃1) = cos( 𝜃2)*cos( 𝜃1) + sen( 𝜃2)*sen( 𝜃1)

so, applying it to the definitions

cos(𝜃_{1} ) = \frac {x_{1}}{A} = x_{1}
sen(𝜃_{1} ) = \frac {y_{1}}{A} = y_{1}

we get that the cosine similarity will be:

x_{1} * x_{2} + y_{1} * y_{2}

As
A = [x1, y1]
and
B = [x2, y2]

Then the cosine similarity or dot product will be

A * B

We can see this also from the definition of cosine similarity:

cosine similarity=

CS=\frac {A * B}{|A|*|B|}

If the vectors are not normalized, the calculation gets much more complicated:

CS=\frac {x_{1}*x_{2}+y_{1}*y_{2}}{\sqrt{x_{1}^{2}+y_{1}^{2}}*\sqrt{x_{2}^{2}+y_{2}^{2}}}

A bit not understand where do you see magnitude of 0.8?

I don’t understand too honestly how should it work with floating numbers from -1 to 1.

If I understand correct, all your example below is about when we have input just -1, 0 and 1, isn’t it? Without float?

The math works exactly the same with floats.

The values will be between -1.0 and 1.0 because they are normalized. (Any value higher than 1 would already mean a magnitude higher than 1)

This:



Numbers in step 1 don’t have to be between -1 to 1.

Step 2 is apparently missed out from the code in the screenshots that you have shared. After step 2, numbers in the resulting matrics will be between -1 to 1.

Step 4 to step 6 show how a matmul becomes some dot products

With normalized vectors (embeddings), numbers in step 7 will always be between -1 and 1 and are called cosine similarity ( as explained by @Netwolf’s post). Since 0.59 is the largest, it is the highest similarity among all Q&A pairs’.

If your example was meant for understanding matmul (as your title said), then the problem is, unlike what the code in your screenshot has, none of your a, b, c, d is matrix. If you would like to, you might repeat my steps with your own matrics (i.e. combine your a and c as my Q, and combine your b and d as my A).

Cheers,
Raymond

This video explains the attention mechanism in an intuitive manner. Hope this helps.

Cheers,
Manoj

1 Like

Looking at this again, they are calling it “dot-product similarity”, and this would explain why they didn’t have my step 2 (normalization). In this case, the orientation still matters (negative dot-product result still means not-similar), but the magnitudes of the embeddings will also play a role.

1 Like

I undertstand that principles of multiplication are the same, but it is not very helps to resolve initial problem.

a = np.array([0.2, 0.99])
b = np.array([0.9, 0.99])
a @ b = 1.1601

a = np.array([0.2, 0.99])
b = np.array([0.2, 0.99])
a @ b = 1.0201

So, we always will have more similarity with more high value of feature in embeddings space. But obviously, that a@b on second case are more similar, than in first.

Your example vectors do not have the same magnitude.

Normalize them first by dividing through the magnitude. Formula for magnitude in this case is sqrt(x1^2 + x2^2)

A: (0.2, 0.99) has magnitude 1.01
A_norm: (0. 198, 0.98)

B: (0.9, 0,99) has magnitude 1.34
B_norm: (0.67, 0.74)

A_norm*A_norm = 1
A_norm * B_norm = 0,86
B_norm * B_norm = 1

2 Likes

Hey everyone,

So just to clarify - when training as in the code shown, you don’t have to normalize (like in cosine similarity) - the contrastive loss applies Softmax and thus normalizes each row when the loss is computed.

If you would normalize (as in cosine similarity), it’s effectively like multiplying each similarity score by a scalar related to the magnitude of each pair of vectors before the Softmax operation, so may change the learning behavior.

In my experience, you usually don’t need to do this, as it may restrict the space of learned embeddings.

1 Like

Another clarity - Normalizing for achieving cosine similarity is a different normalization from softmax’s. They serve different purposes and have different results. While the first normalization is optional, the second one (softmax’s) is compulsory, and they are not replacement of each other.

However, I do agree that we don’t have to use cosine similarity to learn a working neural network, and yes, the normalization (for achieving cosine similiarity) restricts the space of the learned embedding and will change the results. Also, that normalization takes extra computational time. In other words, we can question whether such time can be transferred to performance improvement.

I agree with @ofermend’s comment that we don’t have to do that normalization (for achieving cosine similarity), since this has been how people have done things since at least word2vec.

Thanks, @ofermend, for bringing this up :wink:

Thank you very much! It was very demonstrative example! We’ve generally learned mean normalization. But this is Euclidean normalization as I understand. I don’t remember in what course it could be enough advanced, maybe I just missed this topic.

But one question now. And what if A: (0.2, 0.99) and B: (0.9, 0,99) are already normalized vectors? What are initial vectors then?

Do I understand correct, that normalization appears inside
torch.nn.CrossEntropyLoss?

  1. Yes, you don’t have to normalize inside the code. It works fine as it is.
  2. The softmax that’s part of the contrastive loss normalizes each row, and that’s why this works fine in practice.

And one more small question. That’s nice, that we take into accounting np.dot between length = magnitude = norm of vectors. So, I can understand, that if length of both vectors is 1, it can’t be more than 1 after multiplication. But if I understand correct, angle between unit vectors should be important too. By this reason as mentioned before Euclidean dot product looks like cosine similarity:

But if I understand we omit the angle in the lab at all. Why is similarity works enough good then?

I can imagine that np.dot of unit vectors a and b with different angles can take into accounting this angles and in fact give a vector length smaller than 1 in the middle between a and b. But I can’t fully understand why. Or what is geometrical role of np.dot then in this case? Is it not a length?

How would you compute the angle \theta given two vectors? The point is that you don’t need to: the dot product is a very simple and cheap way to compute the cosine of the angle without actually computing the angle itself. For our purposes in deciding how similar two embedding vectors are, the cosine of the angle gives us that information. The geometric interpretation of the dot product is that it is the length of the projection of one of the vectors onto the other. Here’s a Math Exchange article found by googling “geometric interpretation of dot product”.

Of course if the angle is what you really want for some other reason, you could then use the arccosine to compute it, but that is not relevant or useful in the current context.

2 Likes