Why do we use dot product as the method to find the relevance between word embeddings?

Hi, I am relearning about NLP. I have read about the attention mechanism and am now learning the word2vec again to understand it better.

I see that word2vec and attention use the dot product of embeddings - which represent geometrically the similarity of those embeddings. In word2vec, the dot product is in the softmax function to calculate the probability of the existence of the center word with the context words given. In attention, the dot product is taken between Q and K to calculate the attention scores, showing how much attention should the terms pay to each other.

My question is why the similarity of these vectors can be used to define the relevance between words. For example in the sentence “I don’t like you”, word2vec will optimize so the dot product of the “don’t” and “like” become bigger, which means these vectors will be more similar, but in fact, the meaning of these two words are not similar. Or in the attention block, “don’t” seems to pay more attention to “like”, but how can this relevance be expressed by the similarity of these words?

I have asked many people in my relationships but no one has enough persuasive answers, so I am so grateful to have your point of view. Thanks so much in advance!

Hi @hiimbach,

thanks for your question.

The dot product helps to figure out what is most relevant. In your example the model might focus on „don‘t“ and „like“ because of the given context and not because these word are semantically similar.

Here you can find a nice explanation of multi head attention, touching upon different focal points of the heads: Multi-headed Attention the mathematical meaning - #2 by arvyzukai

Word2Vec also relies on learned patterns of data, however this is rather an embedding space where similar words have similar embedding vectors (like in one specific head, explained in the forum link above), where similar embedding vectors are close to each other or put differently: synonymous or semantically similar words can be modelled effectively.

So, fundamentally multi head attention is much more geared towards how human beings understand and process words with context to draw conclusions.

Hope that helps!

Best regards