Post Activation Gradient – Week 4

In week three, back-prop starts with dz[2] (derivative of the cost with respective to z): dz[2] = a[2] – y, the loss. Prof Ng says that to compute dz[2], one should compute da[2] (post-activation gradient) and then compute dz[2] with da[2]; however, he says that it is equivalent to simply compute dz as dz[2] = a[2] – y. In week four, the back-prop is initialised by computing da[L] and then da[L] is used to compute dz[L].

What I want to clarify, is why not just use dz[L] = a[L] – y, as in week three. Why is it necessary in week 3 to use the loss, then change to using the “post activation gradient”. There may be some equivalence between the two that I missed…

Please can you clarify.

Hi, Matt.

Perhaps I am simply missing your point, but the formula for dZ^{[L]} is the same:

dZ^{[L]} = A^{[L]} - Y

All this is just a big application of the Chain Rule. At the output layer you have the extra steps of the Loss (vector function) followed by the Cost (scalar average of the loss values). You just have to keep track of what the “numerator” is in Prof Ng’s notation. In this case:

dZ^{[L]} = \displaystyle \frac {\partial L}{\partial Z}

The notation is slightly ambiguous since (e.g.):

dW^{[L]} = \displaystyle \frac {\partial J}{\partial W^{[L]}}

Note the J versus L there. Have you seen Eddy’s thread deriving all this?

Thanks for this! I read the thread and also stepped through the code outside of the jupyter environment. I cleared up my understanding of the notation. I see what’s going on now: they are two different initializations for backprop in the source-code. They are both correct in terms of the chain-rule.