How to understand this describe about word embeddings?

When learning word embeddings, we create an artificial task of estimating P(target∣context). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.

When building a natural language model its important to have good embeddings, otherwise if its just for the sake of learning the creation of embeddings process it doesnt matter much I guess.

1 Like

Can you give us a reference to where Prof Ng makes that statement (i.e. which lecture and the time offset would also be helpful)? I tried searching the transcripts of several of the lectures in C5 W2 and couldn’t find that statement, although it does sound familiar. I’d like to listen to all that he says about it and hope to be able to offer some interpretation.

Without hearing the lecture again, my interpretation would be that we need a metric or cost function to train a model, so for training a word embedding the conditional probability that he shows there is a common choice. But there will undoubtedly be cases in which there could be a lot of words that would make sense or could occur in a particular position in a given sentence. Or to put it in the same terms of the probability expression: with some contexts, there could be many possible “correct” answers. So in other words, the probability of correct prediction in a case like that is not too high. But by trying to maximize it, we get useful training even if the maximum value we can achieve is not very high. Of course then the next question is how you can quantify whether the word embeddings you learn by that training process actually are useful. I’m sure Prof Ng also addresses that point and am hoping it will be clarified by listening to the relevant lecture again.

1 Like

IDK whether can I show the reference of this statement…I am not sure, but there is no other backgroud or context and only this sentence.

But by trying to maximize it, we get useful training even if the maximum value we can achieve is not very high.

Anyway, your explaination makes sense and thank you very much.

1 Like