dZ[2] should be normalised with 1/m. The following lines should remove 1/m.
Correct formula:
Related post: Week 3,4: Why isn't 1/m part of dz^[L]?
dZ[2] should be normalised with 1/m. The following lines should remove 1/m.
Correct formula:
Related post: Week 3,4: Why isn't 1/m part of dz^[L]?
Sorry, but this just says that you are interpreting the notation differently than Prof Ng is. He’s the boss, so he gets to define that:
dZ^{[2]} = \displaystyle \frac {\partial L}{\partial Z^{[2]}}
So it is a vector quantity that has not yet been averaged over the samples. That only happens when he computes the dW and db values. Those are the only ones that are w.r.t. J, as opposed to something else. All the other quantities are just Chain Rule factors that are used in computing the dW and db gradients.
Your comment doesn’t make sense.
If dZ=\frac{\partial L}{\partial Z}, then dW=\frac{\partial L}{\partial W}. The same L.
Where does it say that Prof Ng is required to be consistent in his notation? Those d values are just shorthands. It turns out that:
dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}
You just have to understand the context to see why the formulas turn out the way that they do.
Keep in mind that literally the only dX values that are partial derivatives of J are the dW^{[l]} and db^{[l]} gradients. Literally every other value is a partial derivative of something different than J.
You have to think through how the Chain Rule applies when you compute the gradients of the W or b values at one of the inner layers of the network. The L to J transition is always there, but it’s literally the last step, right? You don’t want to end up with multiple factors of \frac {1}{m} …
Mind you, I’m not saying that your idea of including the factor of \frac {1}{m} only once in the dZ^{[L]} value for the final layer is mathematically wrong. Maybe that is really the simpler way to do it in some sense, but that is not the way Prof Ng has chosen to do it. My opinion in the matter is irrelevant: I’m just explaining how Prof Ng’s notation works.
But note that if you do it your way, you’re sort of “separating” the two pieces of the L to J transition: J is the average of L across the samples, so it’s not just the factor \frac {1}{m}, but also the sum, right? The derivative of the average is the average of the derivatives. In your formulation those two pieces are separated, but in Prof Ng’s they are not. Actually maybe that’s the best way to explain how Prof Ng has structured everything:
He does all the Chain Rule calculations in vector form, meaning everything is derivatives of L until the very last step of computing the actual dW^{[l]} and db^{[l]} values at an individual layer. That’s the point at which he averages the derivatives to get the derivative of the average cost J. That’s why the sum and the \frac {1}{m} only appear at that level.