C5W2: Word2Vec. Softmax (different between e_c and theta_t)

Hello everyone,
I have a question about the SoftMax formulation parameters(e_c, T_theta_t)
Here’s how I currently understand it:

  • The context word embedding vector e_c is computed as E @ o_c, where o_c is a one-hot vector for the context word and E is the input embedding matrix (learnable parameters).
  1. e_c is the context word embedding vector and computed as: o_c * E
    where o_c is a one-hot vector for the context word.

  2. E matrix is learnable parameters of the model, and will be updated in each batch/mini
    batch, with shape(number_of_features, vocab_size)

  3. Now, my question is about “θ_t”
    Is “θ_t” the output embedding vector corresponding to word t, just like e_c is the
    embedding vector for the context word? like (θ_t = o_t * E) ?

To clarify further with an example:
If both the context word and the target word are the same, e.g., context = "Noah" and target = "Noah", does that mean e_c == θ_t in that case?

Thanks in advance!

Hello, @Yasmeen_Asaad_Azazi,

Target word embeddings and Context word embeddings are two different sets of embeddings and they are both initialized randomly, so e_c is not equal to θ_t.

To take an embedding, we do e_c = E @ o_c for the context word, and, similarly, θ_t = Θ @ o_tfor the target word.

Note the difference in the order of the variables and the operators in your post, but the above is how I would write them.

Cheers,
Raymond

1 Like

Thank you so much for your clarification, and apologies for my late reply. I have one more question, please: the final values of the embedding matrix that we use afterwards — are they considered E or Θ?

Hello @Yasmeen_Asaad_Azazi,

If you just need a set of embedding for your words, then the target word embedding can be your first choice.

Moreover, in some studies, people found the aggregation of the two could do better. For example, GloVe, which is work published after Word2Vec and is also covered by this course, has mentioned the following in its paper:


image

You see - under certain constraint, we can aggregate the two and do better for certain tasks.

The GloVe lecture also covered this (see the green formula), only it takes average instead of the sum used in the paper.

For which one is better for your case, it takes your effort of testing them to know it.

Cheers,
Raymond

1 Like

Thanks a lot for the explanation​:folded_hands::sparkles:

1 Like

You are welcome, @Yasmeen_Asaad_Azazi!

1 Like