In the video “Backpropagation through time” (see last slide) Andrew uses the cost function [-y . log yhat - (1-y) . log(1-yhat)].
But in the video “Language model and sequence generation” (see last slide) he uses the cost function [-SUM( y . log yhat) ].
Is there an error in this last video or I am missing something?
It looks like in the first case the output is from a sigmoid activation function (binary classification) whereas in the second case, it’s a softmax output (multiclass classification).
The loss functions are really the same.
Hi Paul, thanks for your reply. I can see now the difference and that in the case of the Softmax function we already have a vector ‘y’ with zeros and ones that make this one-term version of the cost function to behave in a similar way that the two terms cost function does for the one-scalar result of the sigmoid function.
Exactly. In the softmax case the y values are “one hot” vectors. In the binary (sigmoid) case, the loss formula takes the single 0 or 1 label and manually converts it to a 2 element “one hot” vector.