The derivative of J is 1/m * (a^[L] - y), but instead of putting 1/m here, the formulas in the slides multiply every dW and db by 1/m.

Is there a reason for this? Placing the 1/m with the derivatives of the last layer works for the assignments and is mathematically correct, so it seems odd to write the formulas as they are in the slides.

More specifically, the formula for dAL in the first assignment for week 4 is technically wrong if dAL means â€śgradient of J with respect to ALâ€ť, itâ€™s only corrected after the fact in the â€ślinear_backwardâ€ť function. But if dAL was computed correctly to begin with, we wouldnâ€™t need to divide by m in â€ślinear_backwardâ€ť.

I agree with you. If 1/m is included in the first dZ (i.e. dZ[L]), dW, db and dA wouldnâ€™t need to include 1/m at all. And this is technically the right thing to do. Use a two layer NN as an example, I think the correct version is dZ[2] = 1/m*(A[2]-Y), dW[2] = dZ[2]A[1].T, db[2] = np.sum(dZ[2], axis = 1, keepdims = True), because 1/m has already been included in dZ[2]. I think the slides have some mistakes, so does the assignment in week 4. If I used the technically right formula instead of what was given, it gave me an error I wouldnâ€™t pass the assignment 4. @albertovilla

Sorry, but this is not a typo. The reason you think that is that Prof Ngâ€™s notation is slightly ambiguous. You need to keep track of what the â€śnumeratorâ€ť is on the partial derivative term. Note that:

dA = \displaystyle \frac {\partial L}{\partial A}

But for dW and db the derivatives are of the scalar cost J:

dW = \displaystyle \frac {\partial J}{\partial W}

Of course J is the average of the vector quantity L over the samples, so thatâ€™s where the factor of \displaystyle \frac {1}{m} comes in.

The way Prof Ng structures everything here, it is only the â€śfinalâ€ť gradients that we actually are going to apply that are derivatives of J. All the rest are just Chain Rule factors. The only â€śfinalâ€ť gradients are those of W^{[l]} and b^{[l]}.

dz notation was said to denote â€śthe derivative of the output w.r.t. zâ€ť. The output in the case of m training samples is certainly J. The symbol L in the derivative \frac{\partial L}{\partial z} doesnâ€™t even make sense, because there are m different L-s: L(a^{(i)}, y^{(i)}) for all i = 1,\dots,m.

Thatâ€™s why I think when Prof Ng goes from one training example to m, and thus from L to J, the meaning of dz should change consistently, and this transition must be emphasized.
Awkward 1/m factors will then disappear, and what remains in the formulas for dW and db will be just the straightforward chain rule.

Sure, it makes sense. As you say L is a vector quantity. You can take the gradient of a vector and itâ€™s a vector of the same shape. So you get a vector of m elements with the individual gradients. At least thatâ€™s the way Prof Ng does things. Folks really familiar with vector and matrix calculus will raise their eyebrows at that (in pure math the gradient of a vector will be oriented as the transpose of the base object), but he is consistent in using that formulation and it keeps things a bit simpler.

If you include the factor of \frac {1}{m} in every dZ^{[l]}, then you get multiple factors in any layer but the last. Play out the Chain Rule implications for the second to last layer and youâ€™ll see what I mean. It would work if you include that factor only in dZ^{[L]} and then you could eliminate it everywhere else. That would be both mathematically correct and arguably simpler to write, but thatâ€™s not how Prof Ng chooses to do it. My guess is that he does not prefer that because it makes dZ^{[L]} different than dZ^{[l]} for the hidden layers. With his method, the formula for dZ^{[l]} is general. Of course you are welcome to your opinion in preferring one method over the other, so when youâ€™re the professor teaching the class you can do it your preferred way.

Iâ€™m not sure that I understand you, but if L takes a vector A and returns a vector, then itâ€™s derivative will be a Jacobian matrix. So it still doesnâ€™t make sense. Possibly you mean that we need to take the diagonal of the matrix and put it into a vector, but itâ€™s not very obvious to say the least.

Iâ€™m not proposing to include anything anywhere. Iâ€™m proposing to interpret dZ^{[l]} as described in the lectures (i.e. \frac{dJ}{dZ^{[l]}}). Then, indeed, 1/m will be only in the formula for the last dZ, but I donâ€™t see why itâ€™s a problem

I can clarify my question a bit: â€śwhy donâ€™t we always interpret dh as df/dh for any output function f and any quantity h?â€ť Currently, dW and db actually mean dJ/dW and dJ/db, but for some strange reason dZ mean m\cdot dJ/dZ. I sincerely think that this will make the materials clearer.

@paulinpaloalto proposed to view at as a vector valued, if I understand correctly.
If we view it as scalar valued, then dL/dZ doesnâ€™t make sense, because there are m distinct values L(a^{(i)}, y^{(i)})

Well, to be precise, there are two different functions:

The Loss (L) is a vector valued function, with one value per sample in the batch.

The Cost (J) is a scalar valued function which is the average of the Loss over the samples in the given batch.

Sometimes the wording can be a little sloppy and people use loss and cost interchangeably. But if you listen carefully to Prof Ng, that is the distinction he makes. And clearly any time he writes a formula, he will specifically use either J or L and the two are not interchangeable.

What Prof Ng is calling the â€śgradientâ€ť here is the elementwise derivatives of the vector loss.

It sounds like you actually have the math background to deal with the derivations of all this. The course is specifically designed not to require even univariate calculus, so Prof Ng does not show the derivations of anything. Hereâ€™s a thread with some links that show the derivations or at least point in the direction of that material.

Again, my point is not about making the derivation more technical.
Itâ€™s about making it more consistent. It will not become more difficult to understand, but quite the contrary, less confusing.

I understand that you may be reluctant to make the proposed changes. But the question of their validity is orthogonal. Anyway, I enjoyed the lectures and think that they are quite neatly organized. Still, I believe they will benefit from this improvement.

Iâ€™m just a fellow student. The mentors do not work for DLAI and have no ability to modify anything. The best I could do is point the course staff to your comments, but trust me this issue has been brought up plenty of times before.