This is theory related and has nothing to do with code, but I was going through the coursework and had a question that I could really use help with.
With Word2Vec Models, this is my understanding:
-
We could randomly do skip-grams where we pick a single target and a single context.
-
We take a positive target + k negative targets at random.
No matter which method we use, the first step is encoding your context using E
Professor Ng says that in the first method, we build a tree of softmaxes which don’t have to be binary to find the correct target.
The second method is much simpler and involves a single logisitc regression layer with number of nodes equal to vocabulary and then we train a loss based on the sigmoid outputs.
Training with k negative samples makes training a model better since we are giving it both a positive and a negative relation to produce a more “complete” representation/embedding
My question is, that the single log reg layer is clearly simpler to execute.
Why can’t we do skip grams followed by log reg?
Is that something that cannot be done for some specific reason that I am missing? Or is Professor Ng merely describing the models defined in literature?