Why do we use dot product as the method to find the relevance between word embeddings?

Hi @hiimbach,

thanks for your question.

The dot product helps to figure out what is most relevant. In your example the model might focus on „don‘t“ and „like“ because of the given context and not because these word are semantically similar.

Here you can find a nice explanation of multi head attention, touching upon different focal points of the heads: Multi-headed Attention the mathematical meaning - #2 by arvyzukai

Word2Vec also relies on learned patterns of data, however this is rather an embedding space where similar words have similar embedding vectors (like in one specific head, explained in the forum link above), where similar embedding vectors are close to each other or put differently: synonymous or semantically similar words can be modelled effectively.

So, fundamentally multi head attention is much more geared towards how human beings understand and process words with context to draw conclusions.

Hope that helps!

Best regards
Christian