Typo in back prop formula (week3 and week 4)

I​ think many of the backprop formulas given in the week 3 and week 4 assignments, as well as the clarification post in week 4 are off by a factor of 1/m. That being said, the typos cancel each other out - the dW 's and db 's, which are what we ultimately care about, turn out to be correct, while the dZ’s and dA’s are incorrect. (My background: I used to teach college calculus class.)

For example, line 2 in the clarification post writes: (forgive me, I don’t know how to use LaTeX here.)

d​Z^[L] = A^[L] - Y

C​omputing the partial derivative of the loss function

-1​/m(\sum y^(i) log(a^(i) ) +(1-y^(i) )log (1-a^(i) )

with respect to a^(i) will give you:

-​1/m * [ (y^(i) * 1/a^(i) - (1- y^(i) )/ (1- a^(i) ) ],

w​hile da^(i)/ dz^(i) = a^(i) * (1 - a^(i) ),

s​ince the activation function in the last layer is the sigmoid function.

W​e then use chain rule to get

d loss/ dz^L = 1/m * ( a^(i) - y^(i))

We get d​Z^[L] just by stacking the above terms together into a 1,m row vector, which should be 1/m * A^[L] -Y.

Note that the 1/m terms in the formula of dW 's and db’s should be removed - there is no 1/m term in the formula

Z​=WA+ b,

s​o you shouldn’t see those 1/m terms here.

The formulas are correct. The problem is that Prof Ng’s notation is a bit ambiguous. The key point you need to keep track of is whether the given quantity is a partial derivative of the cost J (which is the average of the loss) or the loss L (which is a vector valued function and not an average) or whether the thing you are calculating is just a Chain Rule factor at the given layer. If you taught calculus at the college level then with the information I just gave you, you should be able to figure this out. (I was also a math graduate student, so I taught calculus to undergraduates back in the day as well. But it was a looong time ago. :nerd_face:)

For example:

dZ = \displaystyle \frac {\partial L}{\partial Z}

so there is no factor of \frac {1}{m} because L is a vector valued function, not an average of anything. On the other hand:

dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}

The terms in which you see the \frac {1}{m} are only dW^{[l]} and db^{[l]} because those are the only quantities that are PDs w.r.t. J.

Thanks for your quick reply and pointing out the differences in J and L, which I later saw in assignment 4.1.

It seems a bit unnecessary to involve both things (L and J) in the partial derivative computation. It would have been much cleaner to stick with J only. (Am I missing something here?) After all, we are applying chain rule to compute all these derivatives - for example, to compute dJ/dW^[l], we should first compute dJ/dZ^[l], which you get from backprop, then compute dZ^[l]/dW^[l], and then matrix multiply them. Instead, we have to rescale everytime, since what we have computed was dL/dZ^[l], instead of dJ/dZ^[l].

Regarding the first displayed equation you wrote: by dL/dZ, you probably meant stacking together the scalars dl^(i)/d z^(i) , where l^(i) is the loss of the i -th example. (Here, I assume Z is of shape (1,m)) Indeed, the left hand side of the equation, dZ, should have the same shape as (or, depending on notation, the transpose of) Z. On the other hand, both L and Z are m-dimensional, so dL/dZ should be a m by m matrix, with only elements dl^(i)/d z^(i) along the diagonal. We don’t want that, since then the dZ term in the computation of dW and db will mess up the dimensions.

By the way, how did you write LaTeX here?

You can interpolate LaTeX by bracketing the expression by single dollar signs. This is covered on the FAQ Thread.

W.r.t. your larger points, I suggest you think more carefully about what we are actually doing here. Prof Ng is showing you how to break down the computation into multiple steps. We are using the Chain Rule everywhere, but remember that J is the very last step, right? Think about what happens in the computation for a layer other than the last. How about the very first hidden layer? How many other layers do you have to go through in order to get to J? We can only include the factor of \frac {1}{m} once for each dW^{[l]} and db^{[l]} value, right? Otherwise we end up with \frac {1}{m^n}, where n is the number of subsequent layers.

OK, I must be making some kind of blunder here.

What I proposed is to define all the partial derivatives with respect to J, instead of what you called L, which is an m-dimensional vector, when we are differentiating with respect to Z's. The point is that \frac{\partial L}{\partial Z} is of dimension m*\dim Z, which messes things up when we use them for calculations for dW and db's. It will also make things a lot more consistent.

In this case,

\frac{\partial J}{\partial Z^{[L]}}= \frac{1}{m}(A^{[L]}-Y),
\vdots
\frac{\partial J}{\partial Z^{[l]}} = W^{[l]^T}\frac{\partial J}{\partial Z^{[l+1]}}*g'^{[l-1]}(Z^{[l-1]})
\vdots
\frac{\partial J}{\partial W^{[l]}}= \frac{\partial J}{\partial Z^{[l]}} A^{[l-1]^T}.

(no 1/m in the last line.)

In fact, I used this and passed in assignment 3, where we are not graded upon the desired value of dZ's. What I wrote still give us the correct values of dW's and db's.

1 Like

There can be multiple correct ways to write these formulas. Please realize that dZ^{[l]} being a vector does not “mess up” anything. You have to understand the notational conventions that Prof Ng uses:

When he means “elementwise” multiplication, he always indicates that by using “*” as the operator.

When he means normal matrix multiplication, he simply writes the operands adjacent to each other with no explicit operator.

So consider the formulas for computing dW^{[l]} as Prof Ng writes them:

dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]})
dW^{[l]} = \displaystyle \frac {1}{m} dZ^{[l]}A^{[l-1]T}

I would prefer to write that last one this way to make the operation explicit:

dW^{[l]} = \displaystyle \frac {1}{m} dZ^{[l]} \cdot A^{[l-1]T}

The dimensions of dZ^{[l]} are n^{[l]} x m, where n^{[l]} is the number of output neurons in layer l and m is the number of samples, right?

The dimensions of A^{[l-1]} are n^{[l-1]} x m.

So the result of that dot product will have dimensions n^{[l]} x n^{[l-1]}, which are the dimensions of W^{[l]}, right?

Actually your formulation is arguably clearer and more sensible, since it’s a more “pure” application of the Chain Rule. As we discussed earlier, the factor of \frac {1}{m} literally comes in only at the very last step when you compute J as the average of the L values. Thank you for publishing your version of the formulas.

But either formulation is technically correct and Prof Ng is the boss here, so he gets to choose how he wants to write the formulas. Note that he has gone out of his way to design these courses not to require any knowledge of calculus. I don’t remember for sure, but it would not surprise me if he actually never utters the phrase “Chain Rule” in any of the lectures. :nerd_face:

1 Like

Hello @cclau,

If you are interested, you can have a look at

where I have derived the backprop formulas using only J.