Clarification of the Derivative of the Log Loss Function

After the Forward and Backpropagation video in Week 4, I was wondering why the derivative of the loss function for Logistic Regression is:

da^{[L]} = - \frac{y}{a^{[L]}} + \frac{(1-y)}{(1-a^{[L]})}

and not, as it was shown in the other videos I think, this:

da^{[L]} = a^{[L]}-y

I read through those extra articles in the notes and think I understand why, but I just wanted some clarification to make sure I fully understand.

Is it because the first equation above is the more generic derivative of the Log Loss function?

J = -(y \log{(a^{[L]})} + (1-y) \log{(1-a^{[L]})}

And the second equation is when we use the Sigmoid activation function:

a^{[L]} = \frac{1}{1+e^{-z^{[L]}}}

Does that mean if we use a different activation function for the final layer in a Classification problem which uses the Log Loss function as the cost, then we would need a different calculation for the derivative? (Or I guess we could just plug in the new activation function in the generic equation if we cannot find or derive a simplified version as we have for the Sigmoid).

Those are two different things:

The first formula you show is \displaystyle \frac {\partial L}{\partial A} which in Prof Ng’s notation he calls dA^{[L]}.

The second formula is \displaystyle \frac {\partial L}{\partial Z} which is dZ^{[L]}. It is the first formula times the derivative of sigmoid, so it’s one step of applying the Chain Rule.

Please see this thread for more information and the derivations.

You’re right that if you choose a different output layer activation or a different loss function, then you have to go back to the generic formulas and derive it all again. The reason it works out so nicely for us here is that the cross entropy loss function and sigmoid are a natural pair. The same is true once we get to multiclass classifications and use softmax plus cross entropy loss there.

1 Like

Great, thank you for the clarification. It makes sense, I forgot about the Chain Rule in this case.