Since ‘m’ is a constant, it doesn’t matter whether it’s included or not. It’s just a scaling factor. It has no impact on which weight values will give the minimum cost.
Thank you very much for the answer. But I still cannot clearly understand this scaling change, mainly because dA_prev’s scaling looks still consistent with Course 1 materials. I would really appreciate if you can provide clarification on the additional questions below.
I think my question is conditioned on assumption that, bewteen Course 1 and Course 4, scaling has changed only for dW and db, not for dA_prev. Maybe my first question is “is it the case?”
If so, wouldn’t changing scales only for dW and db makes differences in path and speed to find the optimal weights, especially when m is very large? Is scaling not a concern because we usually use more advanced optimization method like Adam? Or is it still not a concern even if we use a basic batch gradient descent with a single fixed learning rate?
dA_prev is a per-sample variable, so it is not scaled by m. It is true that dW and db are required by the C1W3A1 assignment to be divided by m, however, in C4W1A1 there is no such requirement. This can be verified from the exercise’s description.
Let’s focus on the C1W3A1 first. If we check out all of the formula, we should find that the “origin” or the “source” of 1/m is actually from the cost function. We are just passing down this 1/m through the back-prop and finally add it back to dW and db.
The same idea should apply to C4W1A1, even though this time the assignment does not show (or use) any cost function because it is not a complete neural network this time (It is a complete NN in C1W3A1).
Therefore, when we implement a complete convolutional NN, there should be a 1/m for dW and db due to the cost function. I hope this point is clear now.
As Tom has explained, 1/m can be looked as a scaling factor. If dW and db are not divided by m, then we will need to adjust another scaling factor which is the learning rate to compensate for that. Therefore, it is a concern, and it is the same concern no matter we use Adam or not.
However, as I said above, in a complete CNN, the 1/m should pass down to each dW and db, and thus the learning rate would not need to worry for this problem.
I really appreciate your detailed explanation, Raymond! Now I better understand what’s going on.
And I think now I understand Tom’s comment. Because parameters here are only W and b, applying same scaling factor to dW and db would not change the solution as far as if learning rate compensate it. That makes sense. Thank you, Tom!