Week2 - Learning Word Embeddings

In the Learning Word Embeddings video, Andrew suggests that we need to build a language model using a neural network architecture in order to learn word embeddings. However, in the Week 1 third video (Recurrent Neural Network) the whole discussion is around why we need to use RNN for sequence based applications such as Language models, etc and why it’s not a good idea to use naive standard neural nets. I am a bit confused here. Why do we use a neural net to learn word embeddings and not a RNN? For example as suggested in the video, if we were to choose the previous 4 words as a context “a glass of orange” and feed them as a feature vector(4*300) to a hidden layer and then softmax layer to predict the target word “juice”, how can the model learn to predict “juice” since we are not maintaining the temporal information of the context(the sequence order is gone here because we feed the whole feature vectors to a single hidden layer)?

Welcome to the community !

I think it is a good question.

I think you may have a chance to hear n-grams, a sequence of n words. This is sometime used for a word prediction, i.e, “n”-th word prediction based on 1~(n-1)th word. In early 2000, Bengio published a paper, A Neural Probabilistic Language Model that uses neural network for a word prediction. And, Arisoy applied DNN for that. Here is the link to the paper, Deep Neural Network Language Models. These are quite similar to what you proposed. Applying a neural network to predict “n”-th word. So, good starting point actually.

Then, many researchers started to think more longer sentences. Here is the summary by Mikolov in this paper, Recurrent neural network based language model.

A major deficiency of Bengio’s approach is that a feedforward network has to use fixed length context that needs to be specified ad hoc before training. Usually this means that neural networks see only five to ten preceding words when predicting the next one. It is well known that humans can exploit longer context with great success. Also, cache models provide complementary information to neural network models, so it is natural to think about a model that would encode temporal information implicitly for contexts with arbitrary lengths.


Recurrent neural networks do not use limited size of context. By using recurrent connections, information can cycle inside these networks for arbitrarily long time.

RNN became the center of this research, but, it sill had a problem. Even RNN, keeping a context of a longer sentence was difficult. Then, LSTM was introduced.

Later in this course, you will learn “attention” and “transformer”. Actually, recent works in this area are “transformer” based, not RNN. This is another major step up of a language model. The problem is there are billions of parameters to learn, which needs huge computational power. So, we usually use a pre-trained model, and do fine-tuning to our target area with a small corpus.

You are on the right track. Enjoy learning !


Thank you so much for your thorough explnation