Andrew mentioned that the “Supervised Learning” task won’t produce a good output in general but the goal is to produce good imbedding representation which will happen eventually.
I totally don’t get how the “Supervised Learning” objective won’t be good
skip-gram model is a kind of supervised classification model. Taking the same example in lecture, some of training data (x, y) can be:
(“orange”, “juice”)
(“orange”, “glass”)
(“orange”, “my”)
…
Given an input word “orange”, we do not expect the model can predict one word correctly (nor can I. can you?), because the same input word (“orange” in these training data) has different output. Instead, we hope the model learned “something” about input word is related to output, i.e., embedding vector (or you can say feature vector) of the input word.
I see. Thanks @edwardyu !