Could someone explain to me why does theta of a given word is not expected to be close to a word embedding vector of the same word after training. my logic says it’s supposed to be identical…

If this has to to do with cosine similarity which has a certain formula to compute. The word embedding on the other hand is latent space represantation generated by a model, how can they be the same!

Because a target word t usually does *not* also appear in its context c, the dot product between theta_t and e_c for c=t should produce a small score that yields a low softmax probability p(t|c).