Dividing by "m" in back propagation using vectorized implementation

Right! The key point is that Prof Ng’s notation for the gradients is a bit ambiguous. You need to remember what the “numerator” is on the partial derivative. E.g.:

dW^{[1]} = \displaystyle \frac {\partial J}{\partial W^{[1]}}

Since J is the mean of L, the vector loss across m samples, then of course that gradient will include the factor of \frac {1}{m}.

But all the gradients other than dW and db values are not partial derivatives w.r.t. J, but something else. They are just Chain Rule factors that we need to compute in order to get the final dW and db gradients which are the ones we really care about, because they actually get used to update the parameters.

For example in the case you mentioned, the gradient for dZ^{[1]} is:

dZ^{[1]} = \displaystyle \frac {\partial L}{\partial Z^{[1]}}

As I mentioned just above, L is a vector quantity with one element for each of the m samples. We haven’t yet taken the average when computing dZ^{[1]}, so there is no factor of \frac {1}{m}.

This topic has come up quite a few times before. Here’s an earlier thread about it.

The other high level point here is that this course is specifically designed not to require calculus as a prerequisite. That’s the good news, but there is accompanying bad news: that means you just have to accept the formulas as Prof Ng gives them to us. Showing the derivations requires that you know multivariate and vector calculus. Here’s a thread with links to more material on this if you have the math background and really want to understand how all the formulas are derived.

2 Likes