Because when you multiply 2 matrices, if elements at same position are similar than the result gets amplified. In the other hand if they are dissimilar the result will diminish.

why?

let’s say

a = [0.2, 3]

b = [0.4, 3]

c = [0.2, 3]

d = [0.2, 3]

If I understand correct

a @ b = [0.08, 9]

c @ d = [0.04, 9]

Obviously, that c and d are similar. But a @ b has bigger scores.

What kind of numbers are the embeddings?

Floating-Point Values

What magnitude?

It is hard to say, but when I see on data of embeddings, looks like that they are normalized between -1 and 1

The principle is what I wrote above, its like cosine similarity and measuring distance between representations of words in embeddings, as well as finding where the attentions is placed on the sentence.

Have a look on this post here in the forum:

if I understand correct, QK matrix contain attention weights between words that will be applyed on V matrix containing exact values of word embeddings in the future. I am not sure 100% why QK multiplication works too, but I can imagine that matmul dedcribes some relations between all words and some magic happens during training with that embeddings.

But why should we perceive simple matmul as similarity score? Can you explain me on simple matrix. Why is it not subtract of matrices for example?

Given that these word embeddings are vectors in a high dimensional space then the dot product of these vectors will tell us how similar those 2 words are. If both vectors align then the multiplication will be large otherwise it will be small shown no similarity (this is also the cosine similarity), for example (I have taken this image from wikipedia):

This is principle involved here, also you can read these 2 links also:

If I understand correct there is not dot product, there is matmul here, isn’t it?

It looks like vector on output

So, can you give me example of calculations based on my input?

a \cdot b = 0.2 * 0.4 + 3 * 3 = 9.08

c \cdot d = 0.2 * 0.2 + 3 * 3 = 9.04

That is the definition of the dot product of two vectors, right? You multiply the corresponding elements and then sum the results.

The other high level point in all this is “cosine similarity”, which derives from this mathematical relationship:

a \cdot b = ||a|| * ||b|| * cos(\theta)

where \theta is the angle subtended between the two vectors. In a lot of embedding cases, the vectors are normalized to have unit length so that it’s simple to compute the cosine similarity.

Ok, I understand, so the @ is dot product here and we apply it

so, is the first example looks more similar than the second because 9.08 > 9.04?

Why have we not used cosine similarity in this architecture then? Is it possible at all?

The difference between 9.04 and 9.08 does not seem very significant.

I am not familiar with the material in this short course, so I can’t comment on why they don’t mention cosine similarity. But the dot product gives you the same answer, right? What you call it matters less. Do they discuss the point about normalizing the embedding vectors to have length one?

People have various opinions of his work, but at the start, I felt it to help me to ‘see’ it.

I pause the video a few sections before the action begins:

This is ok that not significant, it was just an example. The main is that 9.04 is less than 9.08, but we expect higher dot product if the similarity is much, as I understand.

They mentioned cosine similarity, but didn’t use for measuring the distance of generated embeddings by two encoders for training.

I don’t know. But by course author’s opinion dot product should do the same function as cosine similarity as I understand.

I don’t remember, that we had accent on this, but yes, we used in the lab into Encoder

```
class Encoder(torch.nn.Module):
def __init__(self, vocab_size, embed_dim, output_embed_dim):
super().__init__()
self.embedding_layer = torch.nn.Embedding(vocab_size, embed_dim)
self.encoder = torch.nn.TransformerEncoder(
torch.nn.TransformerEncoderLayer(embed_dim, nhead=8, batch_first=True),
num_layers=3,
norm=torch.nn.LayerNorm([embed_dim]),
enable_nested_tensor=False
)
self.projection = torch.nn.Linear(embed_dim, output_embed_dim)
def forward(self, tokenizer_output):
x = self.embedding_layer(tokenizer_output['input_ids'])
x = self.encoder(x, src_key_padding_mask=tokenizer_output['attention_mask'].logical_not())
cls_embed = x[:,0,:]
return self.projection(cls_embed)
```

I still not understand, is it so important?

oh, yes, I recall, that we use additional weights to get Q, K and V from embeddings that are learnable.

But I a bit not uderstand a conclusion that you wanted to say.

I had the same confusion as you. The main intuition is that, if the vectors hae the same direction than it will be considered similar.

Example: [1, 0, -1] and [1, 0, 1] will have low similarity score. [1, 0, -1] and [1, 0, -1] will have high similarity score.

What got you confused is, why now a vector [2, 0, -1] would have a higher similarity score compared to the identical vector. They would not, because vectors need to have the same **magnitude**.

You can see in the img_to_encoding() function:

*return embedding / np.linalg.norm(embedding, ord=2)*

This normalizes the embeddings so they will have all the same magnitude.

Feel free to correct me, if I got it wrong.

I’m not sure I had a further conclusion I wanted to say, and ha, I did not mean for you to go through *all* of transformers, just the vector part. Plus, of course, I was not speaking of anyone here.

I just meant I know some people find 3blue1brown’s explication style useful, others… not so much.

Your examples are very looks like that we have deal with only binary integer classification. So, I can imagine, that it will work if we have just one of this numbers in feature places in embeddings space -1, 0 or 1.

But we work with floating numbers, as I see from output of embeddings. And we have problems right I described earlier with multiplicatoin.