@marieak, @paulinpaloalto is better at explaining this, but a really easy (at least in my mind) way to think about it is if you seen 1/m or 1/n (or something similar) in front of a sum-- This basically reads as an ‘average’ or ‘mean’. In a sense, it is a scalar.
The average or mean has no ‘rate’ or change. It is kind of just… fixed.
I am not sure though exactly what course you are referencing here to provide more detail.
Thanks so much for your prompt reply, and happy new year!!
I’m referring to the vectorization of back-propagation.
While I follow this:
I struggle to understand how we discard the 1/m term in the expression for dZ = A - Y in the screenshot below
(with the additional step of dA = (A - Y) / ( A (1 - A) ) not shown)
If we are using the definition of J averaged across 1/m, as provided in my initial post, we should have dZ = (1/m)( A - Y) , no?
Especially that the (1/m) term is later re-introduced when we compute dW and db
The ‘1/m’ term is only relevant when you’re computing the gradient of the overall cost function ( J ). When you compute the derivative with respect to the individual loss ( L ), the ( 1/m ) is excluded because you’re not averaging over
( m ) samples at this point, you’re simply analyzing the contribution of a single training example.
Again, as @Mushi says, in your first screenshot you are referring to the computation of derivative per example. Indeed when you average the derivatives you need the 1/m (number of examples)
@mushi has given the complete and precise answer above, but here’s another past thread that discusses the same point. The issue is that Prof Ng’s notation is ambiguous: you have to pay attention to whether the dSomething
value is a derivative of L or J.