Week 2: How is Negative Sampling a bunch of logistic regression problems?

I’m not understanding how Negative Sampling is a bunch of logistic regression NNs. Do we get k+1 NNs per context word? So is difference between them just the target word in the pair and the label? So what has the probability of 1 and 0 n the negative pairing NNs?

TIA

Hello @Awong,

I may not be fully understanding your question, but let me put it this way:

First, we do not have “negative pairing NNs”. We don’t have “a bunch of NNs”, and we don’t have k+1 NNs. Instead, we have only one NN.

For each context word, we prepare for one positive sample and k negative samples, and then we send all k + 1 samples to train that only NN. That NN needs to see both positive and negative samples so that we train it to distinguish between them.

Among these k+1 samples, the context words are all the same, but they have their own different target words. Only the positive sample has a label value of 1 and the rest 0.

There are k+1 samples, and there is only one NN.

Cheers,
Raymond

1 Like

Thanks @rmwkwok. Some followups.

Is the training set size just (k+1)*number of context words?

How do we decide what context words to use? Is it just a random sample of the corpus? or all unique words in the corpus

What is Andrew referring to then when he says there’s a 10000 or k+1 binary logistic regression classifiers? (starting at 6:45 in the video and he makes more references throughout rest of the video)

Hello @Awong,

Random sampling.

It’s not the unique words. One word can be in two different contexts.

Yes.

Thanks for the time pointer.

First of all, it is still just one NN.

I think Andrew discussed the numbers to show you how the reduction of computation is realized. Let’s analyze the arguments he made (and you can find them by searching the transcript below the video):

Instead of having one giant 10,000 way softmax, which is very expensive to compute

Our initial problem is a NN with an output layer of 10,000 nodes, followed by a softmax. This means that, without negative sampling, for each context word, we get 10,000x computations involved.

we’ve instead turned it into 10,000 binary classification problems. Each of which is quite cheap to compute

Now, we change our way to think about it. We don’t use an output layer of 10,000 nodes, but we use an output layer of 1 node. The output layer’s activation function changes from softmax to sigmoid. The problem changes from a multi-class classification problem to a binary classification problem.

Still, without introducting yet the negative sampling, we are still expecting to give all 10,000 pairs of samples to train this one binary classification network. Each pair is one problem, so we have 10,000 binary classification problems.

Note the difference between network and problem. Andrew said problem.

on every iteration, we’re only going to train … k+1 of them

Now, we introduce negative sampling, and we reduce it from 10,000 pairs (or 10,000 problems) to k + 1 pairs (or k + 1 problems), and we see the reduction in computation cost!

From 10,000 to k + 1!

To conclude, I think Andrew was presenting the flow for us to think starting from (i) the initial 10,000 way softmax problem (with one NN), to (ii) the 10,000 binary classification problems (with one NN), and then finally (iii) k + 1 binary classification problems.

Cheers,
Raymond

2 Likes

Thank you so much, it’s clear to me now. The shift in problem statement was something I missed when I first looked at the lecture.

You are welcome, @Awong! :raising_hands: