Hi, I am relearning about NLP. I have read about the attention mechanism and am now learning the word2vec again to understand it better.
I see that word2vec and attention use the dot product of embeddings - which represent geometrically the similarity of those embeddings. In word2vec, the dot product is in the softmax function to calculate the probability of the existence of the center word with the context words given. In attention, the dot product is taken between Q and K to calculate the attention scores, showing how much attention should the terms pay to each other.
My question is why the similarity of these vectors can be used to define the relevance between words. For example in the sentence “I don’t like you”, word2vec will optimize so the dot product of the “don’t” and “like” become bigger, which means these vectors will be more similar, but in fact, the meaning of these two words are not similar. Or in the attention block, “don’t” seems to pay more attention to “like”, but how can this relevance be expressed by the similarity of these words?
I have asked many people in my relationships but no one has enough persuasive answers, so I am so grateful to have your point of view. Thanks so much in advance!