Other than being the correct dimensions, I’m confused on why you could use the W2 matrix from CBOW as your embedding matrix. Since transposed matrices are not their own inverse, what relationship would the W2 matrix have to transforming from the one-hot vectors into embeddings - since it was trained on going the other direction?
Hi @davidpet
The same reason why you could use the W1 matrix.
Word embeddings is not some “correct” way of solving an equation of some kind that has one correct answer. But it is a useful tool to achieve your goals (primarily - by reducing the sparsity of one-hot approach to some smaller manageable amount on numbers - the embedding space - you can train a model that can predict sentiment, translate languages or whatever your goal is).
Not only that, but also the ReLU application. As I said, the goal is not to get back to W1 or x.
The only relationship (for W1 and W2, also don’t forget b1 and b2) is the cost function and data.
Changing the cost function would get you different weights (W1, W2…), also having different weights because of random initialization (like .rand(N,V)
and .rand(V,N)
) , different random batches or data altogether and other factors would result in different Word Embeddings.
They all would be “wrong” and “correct” at the same time - the only thing that matters is your goal. Which of them does help you predict the sentiment or translate the languages best?
I’m not sure what you mean. Could you elaborate?
Makes sense - thanks!