Week 2: Question about parameters

nailbiter · August 22, 2021, 1:08pm

Hello. I am currently at Week 2 of Sequential Models and I have found myself being unable to quite make sense out of one of the concepts Professor Ng is continuously mentioning.

For example in “Learning Word Embeddings” video, Professor is mentioning that the way to learn word embeddings is as follows:
we take the one-hot encoding column-vector of the “context” word c, multiply it with the encoding matrix E, and the pass it to softmax unit with 10,000 output. We then consider the weights of softmax and matrix E as parameters and train model so that softmax outputs “target” words t.

The matrix E thus obtained will give us the desired encoding.

To make it even more precise, refer to p. 20 of course notes for Week 2. The diagram there reads as
o_c → E → e_c = E*o_c → o_{softmax} → \hat{y}.

Now, here is my question. As far as I remember, softmax unit is a linear transform followed by softmax function (the elements of linear transform are the weights of the unit). And we have one linear transform E before it. Now, as we all remember from the early classes in Deep L specialization, the composition of linear transforms is still linear. My question is therefore: will not it happen that the weights of sotfmax will “absorb” the matrix E, so that we will unable to reliably learn it? (similarly to how all layers of Deep Neural network would be absorbed into single linear transformation if they would not have activation functions following them).

arosacastillo · December 2, 2021, 12:22pm

Hi Nailbiter,

Sorry for the late reply. Let me share my thoughts with you about why the E matrix and softmax are needed in this arquitecture.

Thanks to the transformation using the E matrix similar words do have similar dimensions in the new space.
The best interpretation of the formula for Softmax is that it converts all the output values into a probability distribution.
And as far as I know the purpose of activation functions is to introduce non-linearities in order for the NN to learn more complex features.

Sharing with you some resources to read:

Hope this helps to understand better the reasoning behind.

Best,

Rosa

Topic		Replies	Views
W2 "Neural Language Model" slide missing diagram Sequence Models	1	493	March 18, 2023
Some confusion on Word2Vec model NLP with Sequence Models week-2	1	472	July 5, 2023
Why do we need the softmax parameters in word2vec? Sequence Models	10	572	August 26, 2024
C5W2 Word2Vec video - theta Sequence Models	2	556	January 16, 2023
Week 4 Assignment Transformer Architecture: Linear Layer before Softmax Sequence Models	2	726	May 24, 2021

Week 2: Question about parameters

Related topics