dA derivation; where does the 1/m term go?

@marieak, @paulinpaloalto is better at explaining this, but a really easy (at least in my mind) way to think about it is if you seen 1/m or 1/n (or something similar) in front of a sum-- This basically reads as an ‘average’ or ‘mean’. In a sense, it is a scalar.

The average or mean has no ‘rate’ or change. It is kind of just… fixed.

I am not sure though exactly what course you are referencing here to provide more detail.

1 Like

Thanks so much for your prompt reply, and happy new year!!

I’m referring to the vectorization of back-propagation.

While I follow this:

I struggle to understand how we discard the 1/m term in the expression for dZ = A - Y in the screenshot below
(with the additional step of dA = (A - Y) / ( A (1 - A) ) not shown)

If we are using the definition of J averaged across 1/m, as provided in my initial post, we should have dZ = (1/m)( A - Y) , no?

Especially that the (1/m) term is later re-introduced when we compute dW and db

The ‘1/m’ term is only relevant when you’re computing the gradient of the overall cost function ( J ). When you compute the derivative with respect to the individual loss ( L ), the ( 1/m ) is excluded because you’re not averaging over
( m ) samples at this point, you’re simply analyzing the contribution of a single training example.

4 Likes

Again, as @Mushi says, in your first screenshot you are referring to the computation of derivative per example. Indeed when you average the derivatives you need the 1/m (number of examples)

2 Likes

Okay, got it, thank you!!

I think this is where my confusion stemmed from:

Cheers

1 Like

@mushi has given the complete and precise answer above, but here’s another past thread that discusses the same point. The issue is that Prof Ng’s notation is ambiguous: you have to pay attention to whether the dSomething value is a derivative of L or J.

2 Likes