Word2Vec theta matrice

Hello, do you know what parameters Θ refers to? Is is the parameters of the softmax function?

I have actually some doubt about my understanding about what the signification of Θt and the sum over all the vocab set (Θj).

I want to know how to get Θt or Θj, is that the product of the embeding vector with the corresponding words pass throught the softmax parameters?

Here is the link of the course :

1 Like

Hello @jourdelune863,

You have a vocab of 10,000 words. When you compute p(t | c), you have a context word and a target word.

You take the context word vector \theta_c out, you take the target word vector \theta_t out, and you operate on them. \theta_j‘s j iterates from 1 to 10,000, because you have 10,000 words. The softmax equation is saying that the probability p(t |c) is a division of the context-target product over the sum of all pairs’ products.

The vectors are trainable parameters that are tuned by gradient descent. You initialize them randomly, and let it be trained.

Raymond

1 Like

This question has clearly been puzzling a few people, as evidenced by the number of closely related posts:

Despite reading all those posts, and the two papers (Bengio et al, 2003; Mikolov et al, 2013) I’m still confused. This softmax expression from the lectures doesn’t appear to be anywhere within the papers:
p(t|c) = \frac{e^{{\theta_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{\theta_j}^T e_c}}

(Note that I’ll stick to the same notation as from the lecture notes, and that e has different meanings depending on context)

The lecture notes, and some comments in this forum, suggest that \theta_t is a vector that depends on the chosen target word t. Separately, \theta is commonly used in the literature to refer to the model parameters in a general sense, and other comments in this forum have suggested that it refers to the weights of … presumably some hidden layer (a la Bengio et al, 2003). The problem is that both statements cannot be true. Either \theta_t is a vector that is derived from target word t through some unexplained mechanism, or it is a set of parameters. In the latter case, it is usually represented as a weight matrix W.

Furthermore, the skip-gram implementation in Mikolov et al (2013), which the Word2Vec lectures are supposedly based on, explicitly drops the hidden layer. So there’s no \theta in the sense of network parameters other than the embedding matrix E.

I would ask please that one of those Clarification slides be added into the training material once this is all explained. Especially given that the assignment depends on understanding this stuff.

The TensorFlow tutorial on word2vec offers an alternative (word2vec  |  Text  |  TensorFlow). Translating the equation there into the notation of our lecture notes we’d have:

p(t|c) = \frac{e^{{e_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{e_j}^T e_c}}

So this at least agrees with some of the commentary. But it doesn’t explain why Andrew would use \theta_t in place of e_t. Furthermore, it isn’t consistent with the network model. This would require that the embedding layer is run against both the context and target words, but the diagrams in the lecture notes suggest that only the context word is.

So, can someone please explain this clearly.

1 Like

Hello @malcolm.lett,

Is your question about why Andrew denoted the target embedding with \theta but the context embedding with a different symbol e?

If so, I think the reason is that these embeddings come from two different embedding layers, as from your tensorflow link ↓
image

Two embedding layers → target_embedding and context_embedding.

In this case, if, incidentially, both the target and (the true) context words are the same word, and let’s say the word is “bye”, then \theta_t and e_c (or \theta_{\text{bye}} and e_{\text{bye}}) will still be two different vectors because they come from two different embedding layers.

I think this would justify why we should use two different symbols for two different embeddings. Agree?

I hope we will agree on the above point first, and if there is still unclear things, we can go through them one-by-one.

Raymond

Oh wow. I didn’t see anything in the lectures that would suggest a separate embedding layer for the target. And even if that is suggested by either of the two papers (which I haven’t read in detail), it’s certainly not obvious.

If there truly are separate embeddings, then sure, a separate symbol makes sense.

I come back to my earlier request, please add some clarification to the lectures notes.

1 Like

For any clarification about the lectures, it will be @Mubsi’s and the team’s decision :wink: (Thanks, Mubsi), but here, @malcolm.lett, I think we can focus on the separate embedding.

First, if there was no separate embedding, then one thing is certain to happen: if the target and the context words are the same word and their embeddings are from the same embedding layer, the dot product \theta_t \cdot e_c is high (at least, can’t be negative), because it will be |\theta_t|^2.

However, most (but not all) of the time, a target word is not its own context word, which means that we should allow the training algorithm to make such dot product a small value (best to be a negative value). This is enabled by two separate embeddings (otherwise, the algorithm might have no choice but to force \theta_t to close to a zero vector, which is not favourable, because the minimum of |\theta_t|^2 is 0).

On the other hand, in the GloVe lecture, Andrew re-used the term \theta_i \cdot e_j, and in the GloVe’s paper, it mentioned that:
image
Maybe the GloVe paper will be a better reference if needed.

Cheers,
Raymond

Furthermore, for those who come to this thread later, we should also note that, in the tensorflow tutorial, only the target word embeddings are named as “w2v_embedding”

image

-and, similarly, only the target word embeddings are extracted, saved, and to be analyzed:

In other words, yes - we only have one set of w2v_embedding, but to train it out, we do need another set of (context) embeddings in the training process. The context embeddings are like anchors.

Cheers,
Raymond