Week 2: Question about parameters

Hello. I am currently at Week 2 of Sequential Models and I have found myself being unable to quite make sense out of one of the concepts Professor Ng is continuously mentioning.

For example in “Learning Word Embeddings” video, Professor is mentioning that the way to learn word embeddings is as follows:
we take the one-hot encoding column-vector of the “context” word c, multiply it with the encoding matrix E, and the pass it to softmax unit with 10,000 output. We then consider the weights of softmax and matrix E as parameters and train model so that softmax outputs “target” words t.

The matrix E thus obtained will give us the desired encoding.

To make it even more precise, refer to p. 20 of course notes for Week 2. The diagram there reads as
o_c → E → e_c = E*o_c → o_{softmax} → \hat{y}.

Now, here is my question. As far as I remember, softmax unit is a linear transform followed by softmax function (the elements of linear transform are the weights of the unit). And we have one linear transform E before it. Now, as we all remember from the early classes in Deep L specialization, the composition of linear transforms is still linear. My question is therefore: will not it happen that the weights of sotfmax will “absorb” the matrix E, so that we will unable to reliably learn it? (similarly to how all layers of Deep Neural network would be absorbed into single linear transformation if they would not have activation functions following them).

Hi Nailbiter,

Sorry for the late reply. Let me share my thoughts with you about why the E matrix and softmax are needed in this arquitecture.

  1. Thanks to the transformation using the E matrix similar words do have similar dimensions in the new space.
  2. The best interpretation of the formula for Softmax is that it converts all the output values into a probability distribution.
    And as far as I know the purpose of activation functions is to introduce non-linearities in order for the NN to learn more complex features.

Sharing with you some resources to read:

Hope this helps to understand better the reasoning behind.

Best,

Rosa

1 Like