C5W2: Word2Vec. Softmax (different between e_c and theta_t)

Yasmeen_Asaad_Azazi · August 2, 2025, 2:21pm

Hello everyone,
I have a question about the SoftMax formulation parameters(e_c, T_theta_t)
Here’s how I currently understand it:

The context word embedding vector e_c is computed as E @ o_c, where o_c is a one-hot vector for the context word and E is the input embedding matrix (learnable parameters).

e_c is the context word embedding vector and computed as: o_c * E
where o_c is a one-hot vector for the context word.
E matrix is learnable parameters of the model, and will be updated in each batch/mini
batch, with shape(number_of_features, vocab_size)
Now, my question is about “θ_t”
Is “θ_t” the output embedding vector corresponding to word t, just like e_c is the
embedding vector for the context word? like (θ_t = o_t * E) ?

To clarify further with an example:
If both the context word and the target word are the same, e.g., context = "Noah" and target = "Noah", does that mean e_c == θ_t in that case?

Thanks in advance!

rmwkwok · August 5, 2025, 1:58am

Hello, @Yasmeen_Asaad_Azazi,

Target word embeddings and Context word embeddings are two different sets of embeddings and they are both initialized randomly, so e_c is not equal to θ_t.

To take an embedding, we do e_c = E @ o_c for the context word, and, similarly, θ_t = Θ @ o_tfor the target word.

Note the difference in the order of the variables and the operators in your post, but the above is how I would write them.

Cheers,
Raymond

Yasmeen_Asaad_Azazi · August 30, 2025, 2:53pm

Thank you so much for your clarification, and apologies for my late reply. I have one more question, please: the final values of the embedding matrix that we use afterwards — are they considered E or Θ?

rmwkwok · September 2, 2025, 1:45am

Hello @Yasmeen_Asaad_Azazi,

If you just need a set of embedding for your words, then the target word embedding can be your first choice.

Moreover, in some studies, people found the aggregation of the two could do better. For example, GloVe, which is work published after Word2Vec and is also covered by this course, has mentioned the following in its paper:

You see - under certain constraint, we can aggregate the two and do better for certain tasks.

The GloVe lecture also covered this (see the green formula), only it takes average instead of the sum used in the paper.

For which one is better for your case, it takes your effort of testing them to know it.

Cheers,
Raymond

Yasmeen_Asaad_Azazi · September 19, 2025, 6:49am

Thanks a lot for the explanation

rmwkwok · September 21, 2025, 11:33pm

You are welcome, @Yasmeen_Asaad_Azazi!

Topic		Replies	Views
Word2Vec theta matrice Sequence Models week-module-2 , coursera-platform	9	307	August 6, 2025
Why do we need the softmax parameters in word2vec? Sequence Models coursera-platform	10	612	August 26, 2024
C5W2 Word2Vec video - theta Sequence Models coursera-platform	2	565	January 16, 2023
Some confusion on Word2Vec model NLP with Sequence Models week-module-2	1	497	July 5, 2023
What is Θt Target in Softmax in Week 2 Sequence Models coursera-platform	1	555	June 24, 2021

C5W2: Word2Vec. Softmax (different between e_c and theta_t)

Related topics