Hi folks, I’ve been reviewing my knowledge of Sequence Models, Course 5, DLS, and have been struggling to get a grasp on the negative sampling technique.

The image is captured at 8:06 in the video Negative Sampling, Learning Word Embeddings: Word2vec & GloVe.

As shown in the attached image, each node represents a binary classifier using the Sigmoid activation.

As far as I understand, each binary classifier is independent of the others and has its own loss function, I assume the loss function is cross entropy. So my question is what is the overall loss function to finally optimize the embedding matrix E and the weights of the classifiers?