This question has clearly been puzzling a few people, as evidenced by the number of closely related posts:
Despite reading all those posts, and the two papers (Bengio et al, 2003; Mikolov et al, 2013) I’m still confused. This softmax expression from the lectures doesn’t appear to be anywhere within the papers:
p(t|c) = \frac{e^{{\theta_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{\theta_j}^T e_c}}
(Note that I’ll stick to the same notation as from the lecture notes, and that e has different meanings depending on context)
The lecture notes, and some comments in this forum, suggest that \theta_t is a vector that depends on the chosen target word t. Separately, \theta is commonly used in the literature to refer to the model parameters in a general sense, and other comments in this forum have suggested that it refers to the weights of … presumably some hidden layer (a la Bengio et al, 2003). The problem is that both statements cannot be true. Either \theta_t is a vector that is derived from target word t through some unexplained mechanism, or it is a set of parameters. In the latter case, it is usually represented as a weight matrix W.
Furthermore, the skip-gram implementation in Mikolov et al (2013), which the Word2Vec lectures are supposedly based on, explicitly drops the hidden layer. So there’s no \theta in the sense of network parameters other than the embedding matrix E.
I would ask please that one of those Clarification slides be added into the training material once this is all explained. Especially given that the assignment depends on understanding this stuff.
The TensorFlow tutorial on word2vec offers an alternative (word2vec | Text | TensorFlow). Translating the equation there into the notation of our lecture notes we’d have:
p(t|c) = \frac{e^{{e_t}^T e_c}}{\sum_{j=1}^{10,000} e^{{e_j}^T e_c}}
So this at least agrees with some of the commentary. But it doesn’t explain why Andrew would use \theta_t in place of e_t. Furthermore, it isn’t consistent with the network model. This would require that the embedding layer is run against both the context and target words, but the diagrams in the lecture notes suggest that only the context word is.
So, can someone please explain this clearly.