Word2Vec theta matrice

jourdelune863 · March 10, 2024, 9:37pm

Hello, do you know what parameters Θ refers to? Is is the parameters of the softmax function?

I have actually some doubt about my understanding about what the signification of Θt and the sum over all the vocab set (Θj).

I want to know how to get Θt or Θj, is that the product of the embeding vector with the corresponding words pass throught the softmax parameters?

Here is the link of the course :

rmwkwok · March 11, 2024, 2:16am

Hello @jourdelune863,

You have a vocab of 10,000 words. When you compute p(t | c), you have a context word and a target word.

You take the context word vector \theta_c out, you take the target word vector \theta_t out, and you operate on them. \theta_j‘s j iterates from 1 to 10,000, because you have 10,000 words. The softmax equation is saying that the probability p(t |c) is a division of the context-target product over the sum of all pairs’ products.

The vectors are trainable parameters that are tuned by gradient descent. You initialize them randomly, and let it be trained.

Raymond

malcolm.lett · August 8, 2024, 8:00am

This question has clearly been puzzling a few people, as evidenced by the number of closely related posts:

Despite reading all those posts, and the two papers (Bengio et al, 2003; Mikolov et al, 2013) I’m still confused. This softmax expression from the lectures doesn’t appear to be anywhere within the papers:
p(t|c) = \frac{e^{{\theta_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{\theta_j}^T e_c}}

(Note that I’ll stick to the same notation as from the lecture notes, and that e has different meanings depending on context)

The lecture notes, and some comments in this forum, suggest that \theta_t is a vector that depends on the chosen target word t. Separately, \theta is commonly used in the literature to refer to the model parameters in a general sense, and other comments in this forum have suggested that it refers to the weights of … presumably some hidden layer (a la Bengio et al, 2003). The problem is that both statements cannot be true. Either \theta_t is a vector that is derived from target word t through some unexplained mechanism, or it is a set of parameters. In the latter case, it is usually represented as a weight matrix W.

Furthermore, the skip-gram implementation in Mikolov et al (2013), which the Word2Vec lectures are supposedly based on, explicitly drops the hidden layer. So there’s no \theta in the sense of network parameters other than the embedding matrix E.

I would ask please that one of those Clarification slides be added into the training material once this is all explained. Especially given that the assignment depends on understanding this stuff.

The TensorFlow tutorial on word2vec offers an alternative (word2vec | Text | TensorFlow). Translating the equation there into the notation of our lecture notes we’d have:

p(t|c) = \frac{e^{{e_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{e_j}^T e_c}}

So this at least agrees with some of the commentary. But it doesn’t explain why Andrew would use \theta_t in place of e_t. Furthermore, it isn’t consistent with the network model. This would require that the embedding layer is run against both the context and target words, but the diagrams in the lecture notes suggest that only the context word is.

So, can someone please explain this clearly.

rmwkwok · August 8, 2024, 1:20pm

Hello @malcolm.lett,

Is your question about why Andrew denoted the target embedding with \theta but the context embedding with a different symbol e?

If so, I think the reason is that these embeddings come from two different embedding layers, as from your tensorflow link ↓

Two embedding layers → target_embedding and context_embedding.

In this case, if, incidentially, both the target and (the true) context words are the same word, and let’s say the word is “bye”, then \theta_t and e_c (or \theta_{\text{bye}} and e_{\text{bye}}) will still be two different vectors because they come from two different embedding layers.

I think this would justify why we should use two different symbols for two different embeddings. Agree?

I hope we will agree on the above point first, and if there is still unclear things, we can go through them one-by-one.

Raymond

malcolm.lett · August 9, 2024, 12:40am

Oh wow. I didn’t see anything in the lectures that would suggest a separate embedding layer for the target. And even if that is suggested by either of the two papers (which I haven’t read in detail), it’s certainly not obvious.

If there truly are separate embeddings, then sure, a separate symbol makes sense.

I come back to my earlier request, please add some clarification to the lectures notes.

rmwkwok · August 9, 2024, 1:54am

For any clarification about the lectures, it will be @Mubsi’s and the team’s decision (Thanks, Mubsi), but here, @malcolm.lett, I think we can focus on the separate embedding.

First, if there was no separate embedding, then one thing is certain to happen: if the target and the context words are the same word and their embeddings are from the same embedding layer, the dot product \theta_t \cdot e_c is high (at least, can’t be negative), because it will be |\theta_t|^2.

However, most (but not all) of the time, a target word is not its own context word, which means that we should allow the training algorithm to make such dot product a small value (best to be a negative value). This is enabled by two separate embeddings (otherwise, the algorithm might have no choice but to force \theta_t to close to a zero vector, which is not favourable, because the minimum of |\theta_t|^2 is 0).

On the other hand, in the GloVe lecture, Andrew re-used the term \theta_i \cdot e_j, and in the GloVe’s paper, it mentioned that:

Maybe the GloVe paper will be a better reference if needed.

Cheers,
Raymond

rmwkwok · August 9, 2024, 2:02am

Furthermore, for those who come to this thread later, we should also note that, in the tensorflow tutorial, only the target word embeddings are named as “w2v_embedding”

-and, similarly, only the target word embeddings are extracted, saved, and to be analyzed:

In other words, yes - we only have one set of w2v_embedding, but to train it out, we do need another set of (context) embeddings in the training process. The context embeddings are like anchors.

Cheers,
Raymond

Topic		Replies	Views
C5W2 Word2Vec video - theta Sequence Models	2	560	January 16, 2023
Why do we need the softmax parameters in word2vec? Sequence Models	10	585	August 26, 2024
Theta parameter introduced In Class 5, week 2 Sequence Models	5	545	August 8, 2024
Some confusion on Word2Vec model NLP with Sequence Models week-2	1	483	July 5, 2023
Why Theta is transposed in Word2Vec Model Sequence Models	1	494	May 25, 2023

Word2Vec theta matrice

Related topics