Week 3,4: Why isn't 1/m part of dz^[L]?

The derivative of J is 1/m * (a^[L] - y), but instead of putting 1/m here, the formulas in the slides multiply every dW and db by 1/m.

Is there a reason for this? Placing the 1/m with the derivatives of the last layer works for the assignments and is mathematically correct, so it seems odd to write the formulas as they are in the slides.

8 Likes

I guess this is linked with the constant multiple rule:

The Constant multiple rule says the derivative of a constant multiplied by a function is the constant multiplied by the derivative of the function.

image

2 Likes

Sure, that’s why the formulas in the slides aren’t wrong, but why would you do that?

For example, if J = 3 * f(g(h(x))), we’re essentially storing f’(g) (instead of dJ/df = 3 * f’(g)), then when we want to compute dJ/dh we say

dJ/dh = 3 * f’(g) * g’(h)

and cache f’(g) * g’(h).

Then

dJ/dx = 3 * (cache) * h’(x).

At this point, the 3 has little to do with the current computation, it’s just correcting the weird cache.

Why not just cache 3 f’(g) initially?

2 Likes

More specifically, the formula for dAL in the first assignment for week 4 is technically wrong if dAL means “gradient of J with respect to AL”, it’s only corrected after the fact in the “linear_backward” function. But if dAL was computed correctly to begin with, we wouldn’t need to divide by m in “linear_backward”.

4 Likes

I agree with you. If 1/m is included in the first dZ (i.e. dZ[L]), dW, db and dA wouldn’t need to include 1/m at all. And this is technically the right thing to do. Use a two layer NN as an example, I think the correct version is dZ[2] = 1/m*(A[2]-Y), dW[2] = dZ[2]A[1].T, db[2] = np.sum(dZ[2], axis = 1, keepdims = True), because 1/m has already been included in dZ[2]. I think the slides have some mistakes, so does the assignment in week 4. If I used the technically right formula instead of what was given, it gave me an error I wouldn’t pass the assignment 4. @albertovilla

3 Likes

I agree with you.

Nonetheless, your error in assignment 4 must be your code’s error. Mine works fine.

Edit: Sorry, I thought you meant assignment 3 in week 3. Haven’t tried assignment 4. Can’t tell.

3 Likes

I also totally agree with you.

I think this is probably a typo.

I reported the problem in the evaluation part of Week4. Let’s wait for improvement.

3 Likes

Sorry, but this is not a typo. The reason you think that is that Prof Ng’s notation is slightly ambiguous. You need to keep track of what the “numerator” is on the partial derivative term. Note that:

dA = \displaystyle \frac {\partial L}{\partial A}

But for dW and db the derivatives are of the scalar cost J:

dW = \displaystyle \frac {\partial J}{\partial W}

Of course J is the average of the vector quantity L over the samples, so that’s where the factor of \displaystyle \frac {1}{m} comes in.

The way Prof Ng structures everything here, it is only the “final” gradients that we actually are going to apply that are derivatives of J. All the rest are just Chain Rule factors. The only “final” gradients are those of W^{[l]} and b^{[l]}.

Here’s another thread that discusses these issues.

3 Likes

Your interpretation is very confusing IMO.

dz notation was said to denote “the derivative of the output w.r.t. z”. The output in the case of m training samples is certainly J. The symbol L in the derivative \frac{\partial L}{\partial z} doesn’t even make sense, because there are m different L-s: L(a^{(i)}, y^{(i)}) for all i = 1,\dots,m.

That’s why I think when Prof Ng goes from one training example to m, and thus from L to J, the meaning of dz should change consistently, and this transition must be emphasized.
Awkward 1/m factors will then disappear, and what remains in the formulas for dW and db will be just the straightforward chain rule.

1 Like

Sure, it makes sense. As you say L is a vector quantity. You can take the gradient of a vector and it’s a vector of the same shape. So you get a vector of m elements with the individual gradients. At least that’s the way Prof Ng does things. Folks really familiar with vector and matrix calculus will raise their eyebrows at that (in pure math the gradient of a vector will be oriented as the transpose of the base object), but he is consistent in using that formulation and it keeps things a bit simpler.

If you include the factor of \frac {1}{m} in every dZ^{[l]}, then you get multiple factors in any layer but the last. Play out the Chain Rule implications for the second to last layer and you’ll see what I mean. It would work if you include that factor only in dZ^{[L]} and then you could eliminate it everywhere else. That would be both mathematically correct and arguably simpler to write, but that’s not how Prof Ng chooses to do it. My guess is that he does not prefer that because it makes dZ^{[L]} different than dZ^{[l]} for the hidden layers. With his method, the formula for dZ^{[l]} is general. Of course you are welcome to your opinion in preferring one method over the other, so when you’re the professor teaching the class you can do it your preferred way. :grin:

1 Like

I’m not sure that I understand you, but if L takes a vector A and returns a vector, then it’s derivative will be a Jacobian matrix. So it still doesn’t make sense. Possibly you mean that we need to take the diagonal of the matrix and put it into a vector, but it’s not very obvious to say the least.

1 Like

I’m not proposing to include anything anywhere. I’m proposing to interpret dZ^{[l]} as described in the lectures (i.e. \frac{dJ}{dZ^{[l]}}). Then, indeed, 1/m will be only in the formula for the last dZ, but I don’t see why it’s a problem

1 Like

The loss function is scalar valued.

1 Like

I can clarify my question a bit: “why don’t we always interpret dh as df/dh for any output function f and any quantity h?” Currently, dW and db actually mean dJ/dW and dJ/db, but for some strange reason dZ mean m\cdot dJ/dZ. I sincerely think that this will make the materials clearer.

2 Likes

@paulinpaloalto proposed to view at as a vector valued, if I understand correctly.
If we view it as scalar valued, then dL/dZ doesn’t make sense, because there are m distinct values L(a^{(i)}, y^{(i)})

1 Like

Well, to be precise, there are two different functions:

The Loss (L) is a vector valued function, with one value per sample in the batch.

The Cost (J) is a scalar valued function which is the average of the Loss over the samples in the given batch.

Sometimes the wording can be a little sloppy and people use loss and cost interchangeably. But if you listen carefully to Prof Ng, that is the distinction he makes. And clearly any time he writes a formula, he will specifically use either J or L and the two are not interchangeable.

1 Like

What Prof Ng is calling the “gradient” here is the elementwise derivatives of the vector loss.

It sounds like you actually have the math background to deal with the derivations of all this. The course is specifically designed not to require even univariate calculus, so Prof Ng does not show the derivations of anything. Here’s a thread with some links that show the derivations or at least point in the direction of that material.

1 Like

Again, my point is not about making the derivation more technical.
It’s about making it more consistent. It will not become more difficult to understand, but quite the contrary, less confusing.

I understand that you may be reluctant to make the proposed changes. But the question of their validity is orthogonal. Anyway, I enjoyed the lectures and think that they are quite neatly organized. Still, I believe they will benefit from this improvement.

1 Like

I’m just a fellow student. The mentors do not work for DLAI and have no ability to modify anything. The best I could do is point the course staff to your comments, but trust me this issue has been brought up plenty of times before.

If the course authors are aware of the problem and decided to leave it as it is, my mission is complete :slight_smile:

Thank you for the discussion, anyway.

1 Like