DLS5 W2 Learing Word embeddings

Just looking for a big picture on how Embedding matrix is computed - may be I missed something in the week 2 lectures. The lectures did cover various underlying concepts (word representation, featurized representation as word embeddings, how to use word embeddings/transfer learning, word embedding matrix E, Skip-Gram model).

Somehow I am missing how enteries of matrix E are computed. I can see the SL training task – but that training task is mostly estimating parameters of the last layer - softmax or logistic classifiers. But not clear how the training takes care of matrix E.


Hi @dds ,

This is my ‘big picture’ on how embedding matrix is computed:

  1. The neural network used to calculate E is actually a very shallow one: It has a hidden layer and the output layer. The output layer being a softmax.

  2. The model is fed with a very large amount of text (corpus). The way you feed these words was explained in lecture, specifically in the Skip-Gram lecture. Another possible way is called CBOW (Continuous Bag of Words). Lets assume that this is clear.

  3. And here’s the trick: while the model is being trained, the W matrix of the unique hidden layer is continuously updated in the back propagation. At the end of training, the W matrix of the hidden layer is our Embedding Matrix E!

And that’s it!

This is, again, the ‘big picture’ on how Embedding matrix is computed.

Let me know if you’d like to double click on any of these points.



I wanted to add something else to item #3 above:

One may ask: how is it that the W matrix of the hidden layer learns the values that represent the semantic of the target words?

In training, we are feeding the model with context words and a target word ( or with a target word and negative samples, depending on the chosen method). For this explanation, lets go with the model of context words + target word, where context words can be, for example 4 words before and 4 words after the target word.

So, if different target words have similar context words, then we would expect that the output of the model would be similar in these cases. And again, if the outputs are similar, then the corresponding W values (calculated in the backprop) would also be similar for these target words.

Another important detail: the hidden layer is a fully-connected layer and the number of neurons is usually set between 100 and 300, and each would represent a ‘feature’. Sometimes more, sometimes less. So the W matrix of the hidden layer would be of size (feature, vocab_size). For 10,000 words and 300 features, this W would be shape (300, 10,000).

@Juan_Olano Thank you so much. I think I get the outline of the process.

Thanks for pointing out CBOW. I must have missed it if it was covered in lectures. In fact from the original paper (Mikolov, et al) it seems CBOW does outputs a word given context words as input (Skip gram goes other way). So it seems that for analogies ot senetnce completion CBOW would be the answer. (Fig 1 of https://arxiv.org/pdf/1301.3781.pdf)

I will read some more and google some more. I will reach out if there are questions. My goal is jost to have a high level picture of what goes on.

1 Like

Thanks for the feedback.

CBOW is very briefly mentioned by Prof. Ng in the Word2Vec lecture, and I think it is worth understanding it as well, along with the Skip-Gram technic.

As you say, CBOX outputs a word given a context while Skip Gram outputs a context given a word, but at the end of the day what I think is the central point of the discussion is ‘where is the embedding matrix generated’, and this is, as shared, the W matrix of the hidden layer.

Google will provide a wealth of links that discuss Word2Vec - if you find any new insight, please share it!



As Juan_Olano pointed out, W can be a matrix of shape [No. of features X Vocab size), where each feature would represent a learning from a single neuron. If I understand correctly, the attributes of the features like ‘fruit’, ‘alive’, ‘royal’ are indicative attributes and in actual we dont need to define these. The model learns these attributes for every word as a column vector of W.
Please correct if my understanding is wrong.

Your understanding is correct. We don’t have to define these attributes. Instead, they are learned by the model while you train a word embedding.

Hello, I also have 1 question, in the computation of softmax p(t|c) of Word2Vec what is the dimension of 𝜃_t?
I saw it in the quiz but I can’t find it mentioned in the Word2Vec video.