Hi folks, I’ve been reviewing my knowledge of Sequence Models, Course 5, DLS, and have been struggling to get a grasp on the negative sampling technique.
The image is captured at 8:06 in the video Negative Sampling, Learning Word Embeddings: Word2vec & GloVe.
As shown in the attached image, each node represents a binary classifier using the Sigmoid activation.
As far as I understand, each binary classifier is independent of the others and has its own loss function, I assume the loss function is cross entropy. So my question is what is the overall loss function to finally optimize the embedding matrix E and the weights of the classifiers?