Week 3 - Backpropagation Intuition - gradient descent

Thinh_Pham · July 18, 2022, 8:18pm

Hi everyone, I have a question in this slide:

Why dW[2] have 1/m in formular?

I understand dW[2] = dZ[2]A[1].T, Prof Ng said “extra 1 over m because the cost function J is this 1 over m of the sum from I equals 1 through m of the losses.”

If it has 1/m because we apply derivative to Cost function J(…) then why dZ[2] doesn’t have 1/m?

Thanks all.

paulinpaloalto · July 18, 2022, 9:55pm

The 1/m occurs because the cost J is the average of the losses across the m samples. So any gradient that is the derivative of J will have that factor because the derivative of the average is the average of the derivatives (think about it for a sec and that will make sense).

So you have to keep track of what each of those quantities is the derivative of. The various dZ values are just “Chain Rule” factors that go into computing the final gradients of W and b, so they are not averages. Remember that Z^{[2]} is a row vector with m columns, right? So dZ^{[2]} will also have m columns. You can see that the average doesn’t come into the picture until you use dZ to compute dW and db for the corresponding layer.

Topic		Replies	Views
Where does the (1/m) come from mathematically? Neural Networks and Deep Learning coursera-platform	6	476	July 25, 2023
Week 3,4: Why isn't 1/m part of dz^[L]? Neural Networks and Deep Learning coursera-platform	19	1325	December 6, 2022
Backpropagation formulas Neural Networks and Deep Learning coursera-platform	7	1084	April 21, 2021
Dividing by "m" in back propagation using vectorized implementation Neural Networks and Deep Learning week-module-3 , coursera-platform	3	476	February 19, 2024
Derivation of formula for dZ[2] Neural Networks and Deep Learning coursera-platform	2	605	May 19, 2023

Week 3 - Backpropagation Intuition - gradient descent

Related topics