Derivation of formula for dZ[2]

Prof Ng’s notation for the gradients is a little ambiguous. It turns out that only the final gradients that we actually apply, which is to say dW and db, are actually gradients of J. All the others are either gradients of L (the vector loss) or simply Chain Rule factors used to compute dW and db.

Of course we know that by definition:

J = \displaystyle \frac {1}{m}\sum_{j = 1}^m L(y^{(j)},\hat{y}^{(j)})

Meaning that J is the average of the L values across the samples in the batch. If you think about it for a second, you’ll see that the derivative of the average is the average of the derivatives. So the factor of \frac {1}{m} only appears in the final gradients of W and b.

Here’s another thread which discusses this in more detail.

Here’s another thread about this and here’s yet another.

1 Like