Week 3: wrong formula for the derivatives dZ[2] in videos and notebook

Wenjie_Zheng · August 19, 2022, 8:53pm

dZ[2] should be normalised with 1/m. The following lines should remove 1/m.

Correct formula:

Related post: Week 3,4: Why isn't 1/m part of dz^[L]?

paulinpaloalto · August 20, 2022, 12:04am

Sorry, but this just says that you are interpreting the notation differently than Prof Ng is. He’s the boss, so he gets to define that:

dZ^{[2]} = \displaystyle \frac {\partial L}{\partial Z^{[2]}}

So it is a vector quantity that has not yet been averaged over the samples. That only happens when he computes the dW and db values. Those are the only ones that are w.r.t. J, as opposed to something else. All the other quantities are just Chain Rule factors that are used in computing the dW and db gradients.

Wenjie_Zheng · August 20, 2022, 5:02am

Your comment doesn’t make sense.

If dZ=\frac{\partial L}{\partial Z}, then dW=\frac{\partial L}{\partial W}. The same L.

paulinpaloalto · August 20, 2022, 3:00pm

Where does it say that Prof Ng is required to be consistent in his notation? Those d values are just shorthands. It turns out that:

dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}

You just have to understand the context to see why the formulas turn out the way that they do.

Keep in mind that literally the only dX values that are partial derivatives of J are the dW^{[l]} and db^{[l]} gradients. Literally every other value is a partial derivative of something different than J.

You have to think through how the Chain Rule applies when you compute the gradients of the W or b values at one of the inner layers of the network. The L to J transition is always there, but it’s literally the last step, right? You don’t want to end up with multiple factors of \frac {1}{m} …

paulinpaloalto · August 20, 2022, 3:44pm

Mind you, I’m not saying that your idea of including the factor of \frac {1}{m} only once in the dZ^{[L]} value for the final layer is mathematically wrong. Maybe that is really the simpler way to do it in some sense, but that is not the way Prof Ng has chosen to do it. My opinion in the matter is irrelevant: I’m just explaining how Prof Ng’s notation works.

But note that if you do it your way, you’re sort of “separating” the two pieces of the L to J transition: J is the average of L across the samples, so it’s not just the factor \frac {1}{m}, but also the sum, right? The derivative of the average is the average of the derivatives. In your formulation those two pieces are separated, but in Prof Ng’s they are not. Actually maybe that’s the best way to explain how Prof Ng has structured everything:

He does all the Chain Rule calculations in vector form, meaning everything is derivatives of L until the very last step of computing the actual dW^{[l]} and db^{[l]} values at an individual layer. That’s the point at which he averages the derivatives to get the derivative of the average cost J. That’s why the sum and the \frac {1}{m} only appear at that level.

Topic		Replies	Views
Derivation of formula for dZ[2] Neural Networks and Deep Learning coursera-platform	2	592	May 19, 2023
Week 3,4: Why isn't 1/m part of dz^[L]? Neural Networks and Deep Learning coursera-platform	19	1302	December 6, 2022
C4W1 CNN back propagation Convolutional Neural Networks coursera-platform	1	618	November 2, 2021
Optional video explaining backpropagation of C1 : dL/dZ[2] = A[2]- y? Neural Networks and Deep Learning coursera-platform	4	501	August 18, 2023
Week 3: Why dZ^[1] = W^[2]T dZ^[2] * g^[1]'(Z^[1]) Neural Networks and Deep Learning coursera-platform	3	903	February 13, 2023

Week 3: wrong formula for the derivatives dZ[2] in videos and notebook

Related topics