in the grad descent of mutiple samples, i think all formulas should have 1/m, not only dw,db.
are there any problems here?
in the grad descent of mutiple samples, i think all formulas should have 1/m, not only dw,db.
are there any problems here?
It’s a good question, but that is not a mistake. The point is that you have to be careful to keep track of which of the gradient values are derivatives of L, the vector loss, and which are derivatives of J, the scalar cost which is the average of L over the samples. In the way that Professor Ng formulates this, the only gradients that are derivatives of J are the dW and db gradients. All the others are of L. So it is only the dW and db values that have the factor of 1/m.
This question has come up a number of times before. Here’s a thread that links to multiple earlier discussions on this point.
Hello @wanghai673
Note that if you give dZ^{[2]} a 1/m, then dW^{[2]} will end up having two 1/m which is a problem.
I would like to just extend a bit from Paul’s excellent answer: Paul emphasizes on the difference between J and L because, by our definition, J is the cost over all samples whereas L is that of a single sample. Therefore, when we say it is L, such as
, we don’t need 1/m because each element in the matrix (or array) of dZ^{[2]} is only about one sample.Cheers,
Raymond
OK,I see,thanks
I think it’s a very good design because it can ensure the consistency of the formula structure for updating all parameters.
another approach is only divide dz[last] by m, and rest of formulas don’t have m but it’s not very cosistent.