Negative sampling involves training a network with binary classifier outputs the length of the vocabulary, but only using k+1 random words on every training iteration? How is it ok to do this and how does it affect the loss function calculation?
This page has a better explanation:
2 Likes