Hello I’m in the fourth week of the first course and I think I’m getting the hang of it but I’m a little thrown by different equations for calculating that initial value of dZ^L when beginning backward propagation.

In my notes I have dZ^L calculated as:

dZ^L = A^L - Y

or alternatively

dZ^L = dA^L * g^L’ (Z^L)

are these both correct ? If so I’m having trouble understanding how they relate. Doesn’t this imply that:

A^L - Y = dA^L * g^L’ (Z^L) ?

I find it confusing that the difference between our predictions and the true labels would be equal to the derivative of the predictions multiplied elementwise with the result of applying the derivative of our activation function applied to Z^L.

That all strikes me as very unintuitive. Can someone point me to a justification of this, or have I misunderstood something.

Well, it may not be intuitive, but you just have to work out the math, remembering that we’re dealing with the output layer here and the activation function is sigmoid.

Prof Ng shows in the lectures and it is given in the notebook that:

Now substitute that in your second formula and remember that because of the aforementioned sigmoid, we have:

g^{[L]'}(Z^{[L]}) = A^{[L]} (1 - A^{[L]})

So you can start from the fully general formula that we use in the hidden layers (as Phuc has shown) or you can use the special simplifications that you get because of the specifics of the output layer.

Actually there’s also a great thread that Eddy and Mubsi created quite a while ago that goes through a lot of these derivations. In case you haven’t seen it, it’s definitely worth a look.