Why is simple matmul of embedding vectors describes theirs similarity?

someone555777 · August 6, 2024, 8:22am

gent.spah · August 6, 2024, 8:27am

Because when you multiply 2 matrices, if elements at same position are similar than the result gets amplified. In the other hand if they are dissimilar the result will diminish.

someone555777 · August 6, 2024, 8:57am

why?
let’s say
a = [0.2, 3]
b = [0.4, 3]

c = [0.2, 3]
d = [0.2, 3]

If I understand correct
a @ b = [0.08, 9]
c @ d = [0.04, 9]

Obviously, that c and d are similar. But a @ b has bigger scores.

gent.spah · August 6, 2024, 9:24am

What kind of numbers are the embeddings?

someone555777 · August 6, 2024, 9:33am

Floating-Point Values

gent.spah · August 6, 2024, 9:48am

What magnitude?

someone555777 · August 6, 2024, 9:58am

It is hard to say, but when I see on data of embeddings, looks like that they are normalized between -1 and 1

gent.spah · August 6, 2024, 10:11am

The principle is what I wrote above, its like cosine similarity and measuring distance between representations of words in embeddings, as well as finding where the attentions is placed on the sentence.

Have a look on this post here in the forum:

someone555777 · August 6, 2024, 10:32am

if I understand correct, QK matrix contain attention weights between words that will be applyed on V matrix containing exact values of word embeddings in the future. I am not sure 100% why QK multiplication works too, but I can imagine that matmul dedcribes some relations between all words and some magic happens during training with that embeddings.

But why should we perceive simple matmul as similarity score? Can you explain me on simple matrix. Why is it not subtract of matrices for example?

gent.spah · August 6, 2024, 10:58am

Given that these word embeddings are vectors in a high dimensional space then the dot product of these vectors will tell us how similar those 2 words are. If both vectors align then the multiplication will be large otherwise it will be small shown no similarity (this is also the cosine similarity), for example (I have taken this image from wikipedia):

This is principle involved here, also you can read these 2 links also:

someone555777 · August 6, 2024, 12:49pm

If I understand correct there is not dot product, there is matmul here, isn’t it?
It looks like vector on output

So, can you give me example of calculations based on my input?

paulinpaloalto · August 6, 2024, 3:14pm

a \cdot b = 0.2 * 0.4 + 3 * 3 = 9.08

c \cdot d = 0.2 * 0.2 + 3 * 3 = 9.04

That is the definition of the dot product of two vectors, right? You multiply the corresponding elements and then sum the results.

The other high level point in all this is “cosine similarity”, which derives from this mathematical relationship:

a \cdot b = ||a|| * ||b|| * cos(\theta)

where \theta is the angle subtended between the two vectors. In a lot of embedding cases, the vectors are normalized to have unit length so that it’s simple to compute the cosine similarity.

someone555777 · August 6, 2024, 3:26pm

Ok, I understand, so the @ is dot product here and we apply it

so, is the first example looks more similar than the second because 9.08 > 9.04?

Why have we not used cosine similarity in this architecture then? Is it possible at all?

paulinpaloalto · August 6, 2024, 3:36pm

The difference between 9.04 and 9.08 does not seem very significant.

I am not familiar with the material in this short course, so I can’t comment on why they don’t mention cosine similarity. But the dot product gives you the same answer, right? What you call it matters less. Do they discuss the point about normalizing the embedding vectors to have length one?

Nevermnd · August 6, 2024, 3:58pm

People have various opinions of his work, but at the start, I felt it to help me to ‘see’ it.

I pause the video a few sections before the action begins:

someone555777 · August 6, 2024, 5:46pm

This is ok that not significant, it was just an example. The main is that 9.04 is less than 9.08, but we expect higher dot product if the similarity is much, as I understand.

They mentioned cosine similarity, but didn’t use for measuring the distance of generated embeddings by two encoders for training.

I don’t know. But by course author’s opinion dot product should do the same function as cosine similarity as I understand.

I don’t remember, that we had accent on this, but yes, we used in the lab into Encoder

class Encoder(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, output_embed_dim):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(vocab_size, embed_dim)
        self.encoder = torch.nn.TransformerEncoder(
            torch.nn.TransformerEncoderLayer(embed_dim, nhead=8, batch_first=True),
            num_layers=3,
            norm=torch.nn.LayerNorm([embed_dim]),
            enable_nested_tensor=False
        )
        self.projection = torch.nn.Linear(embed_dim, output_embed_dim)
    
    def forward(self, tokenizer_output):
        x = self.embedding_layer(tokenizer_output['input_ids'])
        x = self.encoder(x, src_key_padding_mask=tokenizer_output['attention_mask'].logical_not())
        cls_embed = x[:,0,:]
        return self.projection(cls_embed)

I still not understand, is it so important?

someone555777 · August 6, 2024, 5:50pm

oh, yes, I recall, that we use additional weights to get Q, K and V from embeddings that are learnable.
But I a bit not uderstand a conclusion that you wanted to say.

lighfe · August 6, 2024, 9:14pm

I had the same confusion as you. The main intuition is that, if the vectors hae the same direction than it will be considered similar.
Example: [1, 0, -1] and [1, 0, 1] will have low similarity score. [1, 0, -1] and [1, 0, -1] will have high similarity score.

What got you confused is, why now a vector [2, 0, -1] would have a higher similarity score compared to the identical vector. They would not, because vectors need to have the same magnitude.

You can see in the img_to_encoding() function:
return embedding / np.linalg.norm(embedding, ord=2)
This normalizes the embeddings so they will have all the same magnitude.

Feel free to correct me, if I got it wrong.

Nevermnd · August 6, 2024, 9:33pm

I’m not sure I had a further conclusion I wanted to say, and ha, I did not mean for you to go through all of transformers, just the vector part. Plus, of course, I was not speaking of anyone here.

I just meant I know some people find 3blue1brown’s explication style useful, others… not so much.

someone555777 · August 7, 2024, 8:40am

Your examples are very looks like that we have deal with only binary integer classification. So, I can imagine, that it will work if we have just one of this numbers in feature places in embeddings space -1, 0 or 1.

But we work with floating numbers, as I see from output of embeddings. And we have problems right I described earlier with multiplicatoin.

Topic		Replies	Views
Confusion about Q, K, and V matrices NLP with Attention Models week-module-2	9	6260	February 17, 2025
Intuition reagarding why output of "scaled-dot product" attention represents similarity between tokens NLP with Attention Models course-related , week-module-2 , conceptual-question	1	225	May 1, 2024
Understanding of Scaled Dot-Product Attention with math NLP with Attention Models week-module-2	3	440	July 29, 2023
Having trouble understanding the Attention Layer NLP with Attention Models week-module-1	6	569	December 6, 2022
Q,K,V all are same for self attention Sequence Models coursera-platform	5	662	November 19, 2023

Why is simple matmul of embedding vectors describes theirs similarity?

Related topics