Word2Vec theta matrice

Hello @jourdelune863,

You have a vocab of 10,000 words. When you compute p(t | c), you have a context word and a target word.

You take the context word vector \theta_c out, you take the target word vector \theta_t out, and you operate on them. \theta_j‘s j iterates from 1 to 10,000, because you have 10,000 words. The softmax equation is saying that the probability p(t |c) is a division of the context-target product over the sum of all pairs’ products.

The vectors are trainable parameters that are tuned by gradient descent. You initialize them randomly, and let it be trained.

Raymond

1 Like