Hello @jourdelune863,
You have a vocab of 10,000 words. When you compute p(t | c), you have a context word and a target word.
You take the context word vector \theta_c out, you take the target word vector \theta_t out, and you operate on them. \theta_j‘s j iterates from 1 to 10,000, because you have 10,000 words. The softmax equation is saying that the probability p(t |c) is a division of the context-target product over the sum of all pairs’ products.
The vectors are trainable parameters that are tuned by gradient descent. You initialize them randomly, and let it be trained.
Raymond