In the word2vec and negative sampling lectures, why do we need theta_t? Wouldn’t it make sense to just minimize the differences between the embeddings of the context and target words?
I understand that you’re looking for the similarity of theta_t and e_c by taking the inner product, but why do this instead of taking the inner product of e_t and e_c, since we want similar words to be close together?
Also, can someone clarify if theta_t is different for each word? I think that it is, but I don’t understand why we need essentially another whole matrix of parameters besides the embedding matrix.
When maximizing this probability, keeping the numerator lets gradient descent to maximize the numerator’s dot product, and keeping the denominator lets gradient descent to minimize the other dot products in the denominator. We want both, so we keep both.
Each word has an embedding \theta_t as target word , and an embedding \theta_c as context word. They have to be different so that we have the freedom of their dot product to be small. Think about what I say here
Btw, @kdj, you might cross check your latest understanding on why two sets of embeddings by reading the word2vec paper, as I remember they explained it somewhere in the first two pages in a pretty straightforward way.
I usually share the link to the paper here but sorry that I am currently in the middle of something, so you can actually pretty easily find it by googling “word2vec paper”.
Thanks for your responses Raymond! The clarification that there are two weight matrices is helpful.
I read the word2vec paper, but I didn’t see anything about why they used two weight matrices. I don’t understand your comment that the embeddings for context vs target have to “be different so that we have the freedom of their dot product to be small”. We would want the dot product to be large if the words are the same, right?
My apologies. I must have remembered another paper. Let’s think about this question: what do you think generally is the probability that the target word is the context word? Would you think it is large or small? Like, how likely do you think the target word itself appearing within the context range again? Just look at our conversation here, in all these sentences, how likely is a target word being a context word?
Then another question to think about is, if we have only one set of embeddings, what is the dot product of the case that the target word is the context word itself?
If you dotted the context word with itself the dot product would be large, which I guess is not the desired result. That’s a good point!
Just because one word is the most similar to another word does not mean that we should predict that as the context from the target.
An example I can think of is that different parts of speech of the same root word are not good predictions. If the target word is “strong”, then the context word is probably not “strongly”, which is one of the closest words when it comes to cosine similarity.
We need the theta parameters to account for the extra complexity that is involved with prediction (just similarity does not result in good prediction).
I am glad that you have got it, and I agree with everything you said
Just to put it in another way, if “hello” is not likely to be a target word of the context “hello”, we need to be able to let the target embedding of “hello” and the context embedding of “hello” to be different, so we need two embeddings for the word “hello”.
Eventually the algorithm is learning two word embeddings: the e vectors and the theta vectors. Is there any reason why the e vectors are chosen as word embeddings and not the theta vectors? Obviously from computational point of view you have already for free the E matrix mapping the one-hot words to their embedded e vectors, but it’s pretty straightforward to get a similar matrix for the theta vectors embedding. Is there a more fundamental reason why choosing the e vectors? Do you know if anyone in the literature was investigating the properties of the theta embeddings?
Thanks,
Leonardo