Hello @Awong,
Random sampling.
It’s not the unique words. One word can be in two different contexts.
Yes.
Thanks for the time pointer.
First of all, it is still just one NN.
I think Andrew discussed the numbers to show you how the reduction of computation is realized. Let’s analyze the arguments he made (and you can find them by searching the transcript below the video):
Instead of having one giant 10,000 way softmax, which is very expensive to compute
Our initial problem is a NN with an output layer of 10,000 nodes, followed by a softmax. This means that, without negative sampling, for each context word, we get 10,000x computations involved.
we’ve instead turned it into 10,000 binary classification problems. Each of which is quite cheap to compute
Now, we change our way to think about it. We don’t use an output layer of 10,000 nodes, but we use an output layer of 1 node. The output layer’s activation function changes from softmax to sigmoid. The problem changes from a multi-class classification problem to a binary classification problem.
Still, without introducting yet the negative sampling, we are still expecting to give all 10,000 pairs of samples to train this one binary classification network. Each pair is one problem, so we have 10,000 binary classification problems.
Note the difference between network and problem. Andrew said problem.
on every iteration, we’re only going to train … k+1 of them
Now, we introduce negative sampling, and we reduce it from 10,000 pairs (or 10,000 problems) to k + 1 pairs (or k + 1 problems), and we see the reduction in computation cost!
From 10,000 to k + 1!
To conclude, I think Andrew was presenting the flow for us to think starting from (i) the initial 10,000 way softmax problem (with one NN), to (ii) the 10,000 binary classification problems (with one NN), and then finally (iii) k + 1 binary classification problems.
Cheers,
Raymond