I’m watching the Week 3 “Backpropagation Intuition (Optional)” video and at around 13:10, I don’t fully understand his explanation on why he divides by “m” (the number of training examples). I included a screenshot below of the 6 equations I’m wondering about. I get that there is a factor of 1/m in the cost function because it’s essentially taking the average of the losses for the “m” training examples. But I don’t get why dW2 and dW1 has this factor of 1/m but dZ1 and dZ2 don’t.
Can anyone help me understand this? Thanks in advance!
Attached is the screenshot taken form the Vectorizing Across Multiple Examples, where you can see that Z is an array of output from each sample stacked together. As we are calculating the average of dW for use in parameter update, so we need to divide by m.
Right! The key point is that Prof Ng’s notation for the gradients is a bit ambiguous. You need to remember what the “numerator” is on the partial derivative. E.g.:
Since J is the mean of L, the vector loss across m samples, then of course that gradient will include the factor of \frac {1}{m}.
But all the gradients other than dW and db values are not partial derivatives w.r.t. J, but something else. They are just Chain Rule factors that we need to compute in order to get the final dW and db gradients which are the ones we really care about, because they actually get used to update the parameters.
For example in the case you mentioned, the gradient for dZ^{[1]} is:
As I mentioned just above, L is a vector quantity with one element for each of the m samples. We haven’t yet taken the average when computing dZ^{[1]}, so there is no factor of \frac {1}{m}.
The other high level point here is that this course is specifically designed not to require calculus as a prerequisite. That’s the good news, but there is accompanying bad news: that means you just have to accept the formulas as Prof Ng gives them to us. Showing the derivations requires that you know multivariate and vector calculus. Here’s a thread with links to more material on this if you have the math background and really want to understand how all the formulas are derived.
Thank you @Kic and @paulinpaloalto for the responses! This makes much more sense now. Thinking about what the “numerator” is on the partial derivative was especially helpful. Also I’ll be sure to check earlier threads more carefully next time so I don’t ask something that’s been answered before!