in Word2Vec of NLP and Word Embeddings of course 5, I understand that theta is actually parameters (weights) associated with the denselayer followed by the softmax, however, given the previous conventions in the previous courses why it’s transposed?
It’s an implementation detail that Andrew often includes when he’s using “theta” in the notation.
He assumes that all vectors are column vectors of size (n x 1).
So in order to compute their dot product, the first one needs to be transposed, so its size becomes (1 x n).
Then the dot product dimensions are (1 x n) * (n x 1), which gives a scalar result.