Dividing by "m" in back propagation using vectorized implementation

bwilson3853 · February 19, 2024, 7:46am

I’m watching the Week 3 “Backpropagation Intuition (Optional)” video and at around 13:10, I don’t fully understand his explanation on why he divides by “m” (the number of training examples). I included a screenshot below of the 6 equations I’m wondering about. I get that there is a factor of 1/m in the cost function because it’s essentially taking the average of the losses for the “m” training examples. But I don’t get why dW2 and dW1 has this factor of 1/m but dZ1 and dZ2 don’t.

Can anyone help me understand this? Thanks in advance!

Kic · February 19, 2024, 9:32am

Hi @bwilson3853 ,

Attached is the screenshot taken form the Vectorizing Across Multiple Examples, where you can see that Z is an array of output from each sample stacked together. As we are calculating the average of dW for use in parameter update, so we need to divide by m.

paulinpaloalto · February 19, 2024, 5:32pm

Right! The key point is that Prof Ng’s notation for the gradients is a bit ambiguous. You need to remember what the “numerator” is on the partial derivative. E.g.:

dW^{[1]} = \displaystyle \frac {\partial J}{\partial W^{[1]}}

Since J is the mean of L, the vector loss across m samples, then of course that gradient will include the factor of \frac {1}{m}.

But all the gradients other than dW and db values are not partial derivatives w.r.t. J, but something else. They are just Chain Rule factors that we need to compute in order to get the final dW and db gradients which are the ones we really care about, because they actually get used to update the parameters.

For example in the case you mentioned, the gradient for dZ^{[1]} is:

dZ^{[1]} = \displaystyle \frac {\partial L}{\partial Z^{[1]}}

As I mentioned just above, L is a vector quantity with one element for each of the m samples. We haven’t yet taken the average when computing dZ^{[1]}, so there is no factor of \frac {1}{m}.

This topic has come up quite a few times before. Here’s an earlier thread about it.

The other high level point here is that this course is specifically designed not to require calculus as a prerequisite. That’s the good news, but there is accompanying bad news: that means you just have to accept the formulas as Prof Ng gives them to us. Showing the derivations requires that you know multivariate and vector calculus. Here’s a thread with links to more material on this if you have the math background and really want to understand how all the formulas are derived.

bwilson3853 · February 19, 2024, 6:00pm

Thank you @Kic and @paulinpaloalto for the responses! This makes much more sense now. Thinking about what the “numerator” is on the partial derivative was especially helpful. Also I’ll be sure to check earlier threads more carefully next time so I don’t ask something that’s been answered before!

Topic		Replies	Views
Week 3 - Backpropagation Intuition - gradient descent Neural Networks and Deep Learning	1	497	July 18, 2022
Vectorizing Logistic Regression's Gradient Output - why no 1/m? Neural Networks and Deep Learning	2	408	July 18, 2023
C4W1 CNN back propagation Convolutional Neural Networks	1	618	November 2, 2021
Question about derivative formula Neural Networks and Deep Learning week-4	3	14	September 22, 2024
Derivation of formula for dZ[2] Neural Networks and Deep Learning	2	591	May 19, 2023

Dividing by "m" in back propagation using vectorized implementation

Related topics