Understanding Word Embeddings


I am going through week 2 of the last course in the Deep Learning specialization and I am trying to understand word embeddings. Andrew first shows a “feature matrix” as an alternative to the one hot encoding, where the first dimension contains a set of 300 features that he comes up with and the second dimension contains the 10,000 word vocabulary.

After that, Andrew just refers to this as an embedding matrix E which is multiplied by the one hot vectors.

My confusion is regarding the 300 features Andrew pulled out of a hat. How do we come up with those? When he talks about “learning” the embeddings, is he referring to learning which words need to be in those features or learning the weights of those features for each word in the vocabulary? Or both?

Are word embeddings the 300 x 1 vector containing the weights, or are the embeddings the actual words along the first dimension of that vector?

A word embedding is a vector containing weights. Embedding matrix can be thought of as a 2D array of numbers of shape (#words, embedding dimension). Since vocabulary is made of strings, each word in the vocabulary is assigned an index into the embedding matrix to serve as a lookup.

Dimension of a word embedding requires experimentation. If feature vectors are too small, odds are good that words are too close in embedding space and is likely to require a lot of training. If embedding dimension is too high, compute and memory requirements for training the word embeddings is likely to be high.

Learning word embeddings involves 2 things:

  1. Deciding on the vocabulary: Usually, we take the top K most frequently occurring words from the training dataset / use a measure like TF-IDF and consider the rare words as unknown tokens. Idea behind this approach is that uncommon token types like names and phone numbers are considered less informative than the vocabulary. Pick an approach that works best for your problem.
  2. Picking the embedding dimension: Upon deciding on vocabulary, we pick the embedding dimension for a word and learn the embeddings at train time.

Model architecture constraints apply for transfer learning setting of embedding layer weights.

He didn’t pull the feature values out of a hat: they are learned by training the embedding model. The things he has to chose are the vocabulary and then the size of the embeddings (number of features). There are also several choices of methods for training an embedding model. This is covered in Week 2, so maybe after reading Balaji’s response and my response, you should just continue on and watch the rest of the lectures in Week 2 to hear what Prof Ng explains.

Yes, the embeddings are the weight values. That is the “output” of the model: you index that matrix with the index (or one hot vector) corresponding to the word and it returns the 300 x 1 vector of the embedding for that word.

There are a number of pretrained word embedding models that are commonly used and Prof Ng will discuss those also in the rest of Week 2. In some cases, e.g. GloVe, they have trained several different sizes of embeddings and you can choose which one to use in a particular application.