Hello. I am currently at Week 2 of Sequential Models and I have found myself being unable to quite make sense out of one of the concepts Professor Ng is continuously mentioning.

For example in “Learning Word Embeddings” video, Professor is mentioning that the way to learn word embeddings is as follows:

we take the one-hot encoding column-vector of the “context” word c, multiply it with the encoding matrix E, and the pass it to softmax unit with 10,000 output. We then consider the weights of softmax and matrix E as parameters and train model so that softmax outputs “target” words t.

The matrix E thus obtained will give us the desired encoding.

To make it even more precise, refer to p. 20 of course notes for Week 2. The diagram there reads as

o_c → E → e_c = E*o_c → o_{softmax} → \hat{y}.

Now, here is my question. As far as I remember, softmax unit is a linear transform followed by softmax function (the elements of linear transform are the weights of the unit). And we have one linear transform E before it. Now, as we all remember from the early classes in Deep L specialization, the composition of linear transforms is still linear. My question is therefore: will not it happen that the weights of sotfmax will “absorb” the matrix E, so that we will unable to reliably learn it? (similarly to how all layers of Deep Neural network would be absorbed into single linear transformation if they would not have activation functions following them).