The single loss function L derivative of z[2] is a[2] - y. Why the cost functioin J derivative of Z[2] is in the similar format to the single example equation (a[2] - y): A[2] - Y? Where is the 1/m of J going?
Hello @Shawn_Shan! Interesting question…
All the gradients (derivatives) are the chain rule. Are you familiar with it? So, when we add that 1/m with dW, because of the chain rule, it covers the dZ and dA too.
Thank you for your response. I understood chain rule. If dZ is written in the equation in the lecture, all the following derivatives makes sense. But dZ itself should be defined clearly. If dZ is dJ/dZ, it would need 1/m. and then dW, db will not need that 1/m term. I think in the lecture, dz is defined by dL/dz, and dZ should be defined by dSum(L(i)) / dZ instead of dJ/dZ.
See my derivation below.
Hello @Shawn_Shan! I asked the same question some time ago from our top mentor @paulinpaloalto. Here is his answer. Check it out and let me know what you think.
Best,
Saif.
This is very helpful. Basically, dA or dZ is not dJ/dA but dL/dA. This makes sense for all the rest equations.
In fact, the very original question you asked in that thread is what i was trying to do in one of my projects (for linear regression and using ReLU as activation). That was a helpful thread for my work. Thank you
I am glad you like it.
Best,
Saif.