Derivation of formula for dZ[2]

paulinpaloalto · May 19, 2023, 11:22pm

Prof Ng’s notation for the gradients is a little ambiguous. It turns out that only the final gradients that we actually apply, which is to say dW and db, are actually gradients of J. All the others are either gradients of L (the vector loss) or simply Chain Rule factors used to compute dW and db.

Of course we know that by definition:

J = \displaystyle \frac {1}{m}\sum_{j = 1}^m L(y^{(j)},\hat{y}^{(j)})

Meaning that J is the average of the L values across the samples in the batch. If you think about it for a second, you’ll see that the derivative of the average is the average of the derivatives. So the factor of \frac {1}{m} only appears in the final gradients of W and b.

Here’s another thread which discusses this in more detail.

Here’s another thread about this and here’s yet another.

Topic		Replies	Views
Week 3: wrong formula for the derivatives dZ[2] in videos and notebook Neural Networks and Deep Learning coursera-platform	5	817	February 2, 2026
Week 3,4: Why isn't 1/m part of dz^[L]? Neural Networks and Deep Learning coursera-platform	19	1359	December 6, 2022
Week 3 - Backpropagation Intuition - gradient descent Neural Networks and Deep Learning coursera-platform	1	511	July 18, 2022
Where does the (1/m) come from mathematically? Neural Networks and Deep Learning coursera-platform	6	487	July 25, 2023
W3_Vectorization of dZ[2] equations Neural Networks and Deep Learning coursera-platform	5	577	March 31, 2023

Derivation of formula for dZ[2]

Related topics