From my understanding in word2vec we are learning a mapping between a word and its context. The network consists of a single layer followed by a softmax. So we have 2 sets of weights, one for the first layer (W1 in the image below) , and one for the softmax layer (W2 in the image below). I’m assuming that W1 is the weight matrix we want to keep as the final embedding matrix, and we are discarding W2 weights. Is this correct? Then if we are predicting a word from a context, which weights would we be keeping?
If you have an input layer and a hidden layer and an output layer, you have two weight matrices. Both are vitally important.
@TMosh Yes we would keep both weight matrices if we want to continue predicting context given a word. But if we wanted to transfer the embeddings for use with another problem, or maybe I just want to visualize similarity with t-SNE, wouldn’t I just need the first weight matrix?
Hey @Max_Rivera,
I would define the aim of Word2Vec a bit differently. In my opinion, it should be defined as learning a mapping between words and their embeddings (say 300-dimensional vectors). Learning to predict a word from it’s context, is just the task that we are exploiting to learn these embeddings. Once we get the embeddings, the model trained isn’t of much significance to us, at least as per our aim. In fact, you can even understand this by just breaking the name; Word2Vec = Word → Vectors (or Embeddings).
And yes, if you want to obtain the embeddings (once again which is the primary aim), you will need only the first weight matrix. I hope this helps.
Cheers,
Elemento