In the notes, the backpropagation for the entire training set of length m is given as (L = #output layer):

dz[L] =a[L]-y

dW[L] = 1/m*dz[L].a[L-1].T

db[L] = 1/m*sum(dz[L], …)

etc…

I worked out the backpropagation manually, and I obtained the following:

dz[L] =1/m(a[L]-y)

dW[L] = dz[L].a[L-1].T

db[L] = sum(dz[L], …)

etc…

The difference here with the notes is that the ‘1/m’ term happens once at the very start of the backpropagation step, and then naturally propagates in the next equations, contrary to the notes where it is re-added everytime.

I tried the 2nd version of the equations on the Week3 Python submission, and all tests passed and I obtained the same results.

I thus assume the backpropagation written as above is also correct. Am I right? In such case, why prefer dz[L] =a[L]-y in the notes?

Is the difference down to the way dL/da is defined? (for a training set of m-length)

(1) Either stacking in a row vector of length m the derivatives of dL(a(i), y(i))/dz(i) where the ‘1/m’ term does not appear. And then adding the 1/m term at the next backpropagation step when computing dW. As it is done in the notes

(2) Or directly calculating dL(a,y)/dz, in which case ‘1/m’ directly appears

Remember that L is the vector output of the loss function and the cost J is the average of the L values across the samples. So it’s just a question of how you define things. Prof Ng’s notation is slightly ambiguous, but the key thing to realize is that only the dW and db expressions are partial derivatives of J, so they are the only ones that include the factor of \frac {1}{m}. All other terms are either derivatives of L or just “Chain Rule” factors at a given layer. You have to be careful that you don’t end up with multiple factors of \frac {1}{m}, right?

Thank you a lot for your reply!
I forgot L being defined as a vector, only having in mind L(a(i), y(i)) as a raw number.
It makes total sense now, thank you!

Using the fact that the derivative of the average is the average of the derivatives, right? Think about it for a sec and that should make sense, since taking derivatives is a linear operation. So you get the \frac {1}{m} from the last step in the Chain Rule.

I’m still having a hard time understanding your explanations here. for the dZ2 notation, does that mean dL/dZ2, and not dJ/dZ2? If so, could you clarify why we are not taking the derivative of J but L for Z2? Thank you!

Yes, all the dZ values are Chain Rule factors at a given layer. It is literally only the dW^{[l]} and db^{[l]} where all the Chain Rule factors get multiplied together to form the full gradients w.r.t. J.