Why do we use dot product as the method to find the relevance between word embeddings?

hiimbach · October 27, 2023, 5:41pm

Hi, I am relearning about NLP. I have read about the attention mechanism and am now learning the word2vec again to understand it better.

I see that word2vec and attention use the dot product of embeddings - which represent geometrically the similarity of those embeddings. In word2vec, the dot product is in the softmax function to calculate the probability of the existence of the center word with the context words given. In attention, the dot product is taken between Q and K to calculate the attention scores, showing how much attention should the terms pay to each other.

My question is why the similarity of these vectors can be used to define the relevance between words. For example in the sentence “I don’t like you”, word2vec will optimize so the dot product of the “don’t” and “like” become bigger, which means these vectors will be more similar, but in fact, the meaning of these two words are not similar. Or in the attention block, “don’t” seems to pay more attention to “like”, but how can this relevance be expressed by the similarity of these words?

I have asked many people in my relationships but no one has enough persuasive answers, so I am so grateful to have your point of view. Thanks so much in advance!

Christian_Simonis · October 27, 2023, 6:24pm

Hi @hiimbach,

thanks for your question.

The dot product helps to figure out what is most relevant. In your example the model might focus on „don‘t“ and „like“ because of the given context and not because these word are semantically similar.

Here you can find a nice explanation of multi head attention, touching upon different focal points of the heads: Multi-headed Attention the mathematical meaning - #2 by arvyzukai

Word2Vec also relies on learned patterns of data, however this is rather an embedding space where similar words have similar embedding vectors (like in one specific head, explained in the forum link above), where similar embedding vectors are close to each other or put differently: synonymous or semantically similar words can be modelled effectively.

So, fundamentally multi head attention is much more geared towards how human beings understand and process words with context to draw conclusions.

Hope that helps!

Best regards
Christian

Topic		Replies	Views
Scaled dot product attention implicit assumptions NLP with Attention Models week-module-1	3	420	August 17, 2023
Understanding of Scaled Dot-Product Attention with math NLP with Attention Models week-module-2	3	449	July 29, 2023
Why is simple matmul of embedding vectors describes theirs similarity? Embedding Models: From Architecture to Implementat	36	429	August 13, 2024
W2 - quiz - similarity of theta and word embedding vector Sequence Models coursera-platform	2	494	July 5, 2024
Intuition reagarding why output of "scaled-dot product" attention represents similarity between tokens NLP with Attention Models course-related , week-module-2 , conceptual-question	1	235	May 1, 2024

Why do we use dot product as the method to find the relevance between word embeddings?

Related topics