Why can't Skip Grams use logistic regressions?

This is theory related and has nothing to do with code, but I was going through the coursework and had a question that I could really use help with.

With Word2Vec Models, this is my understanding:

  1. We could randomly do skip-grams where we pick a single target and a single context.

  2. We take a positive target + k negative targets at random.

No matter which method we use, the first step is encoding your context using E

Professor Ng says that in the first method, we build a tree of softmaxes which don’t have to be binary to find the correct target.

The second method is much simpler and involves a single logisitc regression layer with number of nodes equal to vocabulary and then we train a loss based on the sigmoid outputs.

Training with k negative samples makes training a model better since we are giving it both a positive and a negative relation to produce a more “complete” representation/embedding

My question is, that the single log reg layer is clearly simpler to execute.

Why can’t we do skip grams followed by log reg?

Is that something that cannot be done for some specific reason that I am missing? Or is Professor Ng merely describing the models defined in literature?

I suppose what your question is why not train skip-gram model in the same way as negative sampling model?
As you know, skip-gram picks context words and target words around context words within a certain window size. It means we only have positive label data, no negative data. If we apply logistic regression, all output labels are 1 (no 0 labels.) The model won’t learn anything.

1 Like

Hi!

Thank you so much for your answer.

It makes sense now.

A linear regression model and the relevant soft-max is built upon there being positive and negative samples.

For example: "I am 90% sure this is a dog, but there is a 10% chance it could be a cat (negative sample).

Skip-grams just have a single positive sample and our job is to find encodings relevant to that positive sample.

So the architectures just don’t match and the ideal model is a tree of binary soft-maxes