Week 3 - Backpropagation Intuition - gradient descent

Hi everyone, I have a question in this slide:


Why dW[2] have 1/m in formular?

I understand dW[2] = dZ[2]A[1].T, Prof Ng said “extra 1 over m because the cost function J is this 1 over m of the sum from I equals 1 through m of the losses.”

If it has 1/m because we apply derivative to Cost function J(…) then why dZ[2] doesn’t have 1/m?

Thanks all.

The 1/m occurs because the cost J is the average of the losses across the m samples. So any gradient that is the derivative of J will have that factor because the derivative of the average is the average of the derivatives (think about it for a sec and that will make sense).

So you have to keep track of what each of those quantities is the derivative of. The various dZ values are just “Chain Rule” factors that go into computing the final gradients of W and b, so they are not averages. Remember that Z^{[2]} is a row vector with m columns, right? So dZ^{[2]} will also have m columns. You can see that the average doesn’t come into the picture until you use dZ to compute dW and db for the corresponding layer.