W2 "Neural Language Model" slide missing diagram

For my fellow visual learners out there, I’d like to offer a clarification to the DLS C5 W2 lecture “Learning Word Embeddings” slide “Neural language model.” It seems to be missing a diagram, which leaves it difficult to see how the procedure can be used to learn embeddings as the slide claims it does. Prof Ng says (via transcript):

So, the parameters of this model will be this matrix E, and use the same matrix E for all the words. So you don’t have different matrices for different positions in the proceedings four words, is the same matrix E. And then, these weights are also parameters of the algorithm and you can use that crop to perform gradient descent to maximize the likelihood of your training set to just repeatedly predict given four words in a sequence, what is the next word in your text corpus? And it turns out that this algorithm we’ll learn pretty decent word embeddings.

However, the diagram does not really show this and seems to show something else. It shows that the one hot matrix being multiplied by the E matrix to produce data of dimension 1800 or 1200 which is fed to a FC layer which is then fed on to a softmax layer. It is then noted that gradient descent is used to learn weights and biases which are shown as W[1], b[1], W[2], and b[2]. However, the E matrix is shown as creating the input, and as we know from the first several courses, inputs are immutable; we learn weights not data.

What Prof Ng says verbally is that we can actually treat the E matrix itself as a set of 1200 weights which can be learned. There are a few issues with this:

  • A diagram showing such a construction is not shown. This makes this harder to understand as noted above.
  • No activation function is specified. (Could it even be linear?)
  • It is not clear how we map from the 1200 dimensional (300x4) learned E matrix to 300x1 dimensional embeddings for each of 10,000 words. How is this done?

Anyway, I wanted to open this discussion item to note the missing details and discrepancy in case it helps anyone else, or in case course staff or others can add any more details including answers to the questions above.

Thanks for your message, I’ll notify the course staff.