Why do we need the softmax parameters in word2vec?

kdj · October 10, 2023, 2:17am

In the word2vec and negative sampling lectures, why do we need theta_t? Wouldn’t it make sense to just minimize the differences between the embeddings of the context and target words?

I understand that you’re looking for the similarity of theta_t and e_c by taking the inner product, but why do this instead of taking the inner product of e_t and e_c, since we want similar words to be close together?

Also, can someone clarify if theta_t is different for each word? I think that it is, but I don’t understand why we need essentially another whole matrix of parameters besides the embedding matrix.

Thanks!

rmwkwok · October 10, 2023, 2:26am

Hello @kdj,

When maximizing this probability, keeping the numerator lets gradient descent to maximize the numerator’s dot product, and keeping the denominator lets gradient descent to minimize the other dot products in the denominator. We want both, so we keep both.

Cheers,
Raymond

TMosh · October 10, 2023, 2:34am

In this material, theta represents the weight matrix. In the rest of the course, it would be called ‘w’.

The ‘t’ is just a symbol for ‘transpose’, and it’s only there because ‘x’ is a vector in this notation.

rmwkwok · October 10, 2023, 3:42am

Each word has an embedding \theta_t as target word , and an embedding \theta_c as context word. They have to be different so that we have the freedom of their dot product to be small. Think about what I say here

Cheers,
Raymond

rmwkwok · October 10, 2023, 4:16am

Btw, @kdj, you might cross check your latest understanding on why two sets of embeddings by reading the word2vec paper, as I remember they explained it somewhere in the first two pages in a pretty straightforward way.

I usually share the link to the paper here but sorry that I am currently in the middle of something, so you can actually pretty easily find it by googling “word2vec paper”.

Let me know how it goes.

Cheers,
Raymond

kdj · October 10, 2023, 4:36pm

Thanks for your responses Raymond! The clarification that there are two weight matrices is helpful.

I read the word2vec paper, but I didn’t see anything about why they used two weight matrices. I don’t understand your comment that the embeddings for context vs target have to “be different so that we have the freedom of their dot product to be small”. We would want the dot product to be large if the words are the same, right?

Thanks for your time.

rmwkwok · October 10, 2023, 9:40pm

Hello @kdj,

My apologies. I must have remembered another paper. Let’s think about this question: what do you think generally is the probability that the target word is the context word? Would you think it is large or small? Like, how likely do you think the target word itself appearing within the context range again? Just look at our conversation here, in all these sentences, how likely is a target word being a context word?

rmwkwok · October 10, 2023, 9:58pm

Then another question to think about is, if we have only one set of embeddings, what is the dot product of the case that the target word is the context word itself?

kdj · October 10, 2023, 11:43pm

If you dotted the context word with itself the dot product would be large, which I guess is not the desired result. That’s a good point!

Just because one word is the most similar to another word does not mean that we should predict that as the context from the target.

An example I can think of is that different parts of speech of the same root word are not good predictions. If the target word is “strong”, then the context word is probably not “strongly”, which is one of the closest words when it comes to cosine similarity.

We need the theta parameters to account for the extra complexity that is involved with prediction (just similarity does not result in good prediction).

I understand now, thank you so much!

rmwkwok · October 11, 2023, 12:06am

I am glad that you have got it, and I agree with everything you said

Just to put it in another way, if “hello” is not likely to be a target word of the context “hello”, we need to be able to let the target embedding of “hello” and the context embedding of “hello” to be different, so we need two embeddings for the word “hello”.

Cheers,
Raymond

IronLeo · August 26, 2024, 6:35pm

Eventually the algorithm is learning two word embeddings: the e vectors and the theta vectors. Is there any reason why the e vectors are chosen as word embeddings and not the theta vectors? Obviously from computational point of view you have already for free the E matrix mapping the one-hot words to their embedded e vectors, but it’s pretty straightforward to get a similar matrix for the theta vectors embedding. Is there a more fundamental reason why choosing the e vectors? Do you know if anyone in the literature was investigating the properties of the theta embeddings?
Thanks,
Leonardo

Topic		Replies	Views
C5W2 Word2Vec video - theta Sequence Models coursera-platform	2	561	January 16, 2023
Word2Vec theta matrice Sequence Models week-2 , coursera-platform	6	265	August 9, 2024
Some confusion on Word2Vec model NLP with Sequence Models week-2	1	485	July 5, 2023
Theta parameter introduced In Class 5, week 2 Sequence Models coursera-platform	5	546	August 8, 2024
Why Theta is transposed in Word2Vec Model Sequence Models coursera-platform	1	495	May 25, 2023

Why do we need the softmax parameters in word2vec?

Related topics