I have a technical question on the gradient formulas used for backpropagation. Hopefully it’s a quick thing to explain, I know it’s not really needed to complete the assignments.
So, in the backprop initialization (week 4, building NN step by step programming assignment) we initialize dA[L], where L is the output layer, as follows:
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
There is a comment that we’re taking derivaties of the cost function, which I understand as the cross-entropy function J. But if so, shouldn’t the expression above be further divided by m? (i.e. no of observations). Just to be clear, I’m sure the formulas provided are internally consistent and correct, the code works fine, but without spending too much time on derivations it looks to me like the 1/m factor I’m missing simply shows up later in the dW and db gradients. Is it just a matter of convention or am I missing something?
Btw I suggest this great online book if you’re interested in the maths and if you don’t already know about it : Neural networks and deep learning (direct link to the cross entropy chapter).
dAL is actually dL/dAL and the loss function L is calculated for a specific training example, whereas the cost function J is the average loss for all m training examples.
I advice you to check the video Forward and Backward Propagation in Week 4 again:
Around the 8 minute mark, Andrew will show you how dL/dAL fits into the bigger picture:
I think that what I found misleading is that in the video, there seems to be no mention from the fact that in the vectorized case, we start from derivating J and not L (we want to minimize the cost for the whole minibatch) – or I missed that.
What I would have done to explain this is just to write the chain rule :
dJ / dW = (dJ / dAl) * (dAl/ dZ) * (dZ/dW)
(sorry for the lousy formatting, these are partial derivatives chained up).
And then to derive each term.
Note that I did not even mention the L step in this equation.
Then, the 1/m comes naturally from the first term.
For example, with mse loss, we have
dJ / dAl = (y - a) / m,
a gradient vector.
I was kind of expecting the first term with its 1/m, which is then applied to both dW and db according to the chain rule.
Thanks for the responses, this distinction between L and J is key; I think what confused me was thinking of J as function of multiple variables rather than sum of indiv. losses.
To illustrate, I’ll go back to A[L] where L is the output layer. This is basically a row 1xm vector. So when I see the expression dJ/dA[L], I interpreted this to mean as the standard gradient vector, i.e. partial derivative of J with respect to each coordinate of A[L]. With J, this works fine for me, because J in this context can be viewed as a multivariate function of aL, …, aL. Thinking of dL/dA[L] in this way is difficult.
I think what’s really happening is we first work out all the gradients for some arbitrary individual example from the set, and then use the fact that J is a nice linear combination of individual losses, which makes it fairly easy to generalize.
Good points! The easy way to see how the derivative of J works out is to remember that J is an average. The derivative of the average is the average of the derivatives. Think about it for a second and that should make sense. Taking derivatives is a “linear” operation and taking the average is also linear.