In the notes, the backpropagation for the entire training set of length m is given as (L = #output layer):

dz[L] =a[L]-y

dW[L] = 1/m*dz[L].a[L-1].T

db[L] = 1/m*sum(dz[L], …)

etc…

I worked out the backpropagation manually, and I obtained the following:

dz[L] =1/m(a[L]-y)

dW[L] = dz[L].a[L-1].T

db[L] = sum(dz[L], …)

etc…

The difference here with the notes is that the ‘1/m’ term happens once at the very start of the backpropagation step, and then naturally propagates in the next equations, contrary to the notes where it is re-added everytime.

I tried the 2nd version of the equations on the Week3 Python submission, and all tests passed and I obtained the same results.

I thus assume the backpropagation written as above is also correct. Am I right? In such case, why prefer dz[L] =a[L]-y in the notes?

Is the difference down to the way dL/da is defined? (for a training set of m-length)

(1) Either stacking in a row vector of length m the derivatives of dL(a(i), y(i))/dz(i) where the ‘1/m’ term does not appear. And then adding the 1/m term at the next backpropagation step when computing dW. As it is done in the notes

(2) Or directly calculating dL(a,y)/dz, in which case ‘1/m’ directly appears

Remember that L is the vector output of the loss function and the cost J is the average of the L values across the samples. So it’s just a question of how you define things. Prof Ng’s notation is slightly ambiguous, but the key thing to realize is that only the dW and db expressions are partial derivatives of J, so they are the only ones that include the factor of \frac {1}{m}. All other terms are either derivatives of L or just “Chain Rule” factors at a given layer. You have to be careful that you don’t end up with multiple factors of \frac {1}{m}, right?

Thank you a lot for your reply!
I forgot L being defined as a vector, only having in mind L(a(i), y(i)) as a raw number.
It makes total sense now, thank you!

Using the fact that the derivative of the average is the average of the derivatives, right? Think about it for a sec and that should make sense, since taking derivatives is a linear operation. So you get the \frac {1}{m} from the last step in the Chain Rule.

I’m still having a hard time understanding your explanations here. for the dZ2 notation, does that mean dL/dZ2, and not dJ/dZ2? If so, could you clarify why we are not taking the derivative of J but L for Z2? Thank you!

Yes, all the dZ values are Chain Rule factors at a given layer. It is literally only the dW^{[l]} and db^{[l]} where all the Chain Rule factors get multiplied together to form the full gradients w.r.t. J.

The partial derivatives you worked out are for a network with forward propagation like Z[l] = W[l] * a[l-1] + b[l], and cost function like J=1/m*Sum(A[L])

Andrew’s equations are actually for a network like the following, Z[i]=1/m*W[i]*a[i-1] + 1/m * b[i], and the cost function like J=Sum(A[L])

So, Andrew’s equations are mathematically incorrect for the way he designed the network.

I didn’t see any advantages in the way of his equations.

Sorry, but I disagree. What you are missing is that dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}. There is no factor of \frac {1}{m} needed at the level of the individual layer forward prop functions in order to get the derivatives that Prof Ng shows. The \frac {1}{m} comes from the last Chain Rule step of computing the cost J as the average of L across the samples.

To see this point more clearly, think about how the Chain Rule works: it builds up more factors at each level. With your formulation, you’d end up with multiple factors of \frac {1}{m}.

What we are discussing is whether Andrew’s equation dW[L] = 1/m*dz[L].a[L-1].T is correct. If you look at one of the reading material under ** Week 4
Optional Reading: Feedforward Neural Networks in Depth**

The derivation in the “I Deep Learning” material is incomplete, because it does not specify any particular loss function. It is focused only on specifics of individual layers, with no view of the total network or what task you are trying to solve.

You cannot implement such a network.

Nor can you compare it to the material Andrew presents.

Jonas does things a bit differently than Andrew does. Notice that he keeps the RHS as derivatives of J, so the \frac {1}{m} is still buried in the \displaystyle \frac {\partial J}{\partial Z^{[l]}}.

Andrew’s formula is correct because his dZ on the RHS is of L. As commented above (I think) and definitely on many other threads, Andrew’s notation is ambiguous and you just have to keep track of the context when he says “d” of something, whether the “numerator” is L or J or yet something else.

Hint: the only time Andrew means derivatives of J is dW and db. Everything else is just a Chain Rule factor that goes into the final computation of dW and db.

I think we are talking at cross purposes here. Prof Ng’s equations are all consistent. The J cost does include the factor of \frac {1}{m} because it is the average of the L values over the m samples. You don’t need to further divide J by m because the factor is already included.

Anyone who thinks they are inconsistent is missing the point I made about needing to be careful about what the “numerator” is in any of the gradients.

The dW and db values include the factor of \frac {1}{m} because they are derivatives of J, which also has the factor. Everything else is not a derivative of J, so does not include the factor.