Backpropagation formulas

Matim · April 21, 2021, 8:01am

Hi!

I have a technical question on the gradient formulas used for backpropagation. Hopefully it’s a quick thing to explain, I know it’s not really needed to complete the assignments.

So, in the backprop initialization (week 4, building NN step by step programming assignment) we initialize dA[L], where L is the output layer, as follows:

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL

There is a comment that we’re taking derivaties of the cost function, which I understand as the cross-entropy function J. But if so, shouldn’t the expression above be further divided by m? (i.e. no of observations). Just to be clear, I’m sure the formulas provided are internally consistent and correct, the code works fine, but without spending too much time on derivations it looks to me like the 1/m factor I’m missing simply shows up later in the dW and db gradients. Is it just a matter of convention or am I missing something?

Colin · April 21, 2021, 8:17am

I have the same question

Btw I suggest this great online book if you’re interested in the maths and if you don’t already know about it : Neural networks and deep learning (direct link to the cross entropy chapter).

jonaslalin · April 21, 2021, 10:55am

Hello @Matim and @Colin,

Great question!

dAL is actually dL/dAL and the loss function L is calculated for a specific training example, whereas the cost function J is the average loss for all m training examples.

I advice you to check the video Forward and Backward Propagation in Week 4 again:

Around the 8 minute mark, Andrew will show you how dL/dAL fits into the bigger picture:

Colin · April 21, 2021, 11:13am

Ah of course ! thanks a lot @jonaslalin

jonaslalin · April 21, 2021, 11:18am

@Matim and @Colin,

I will clarify the formulas even further. Please feel free to derive the backprop formulas yourself for even better understanding

Note the difference between the loss function L and the cost function J.

Colin · April 21, 2021, 11:42am

I think that what I found misleading is that in the video, there seems to be no mention from the fact that in the vectorized case, we start from derivating J and not L (we want to minimize the cost for the whole minibatch) – or I missed that.

What I would have done to explain this is just to write the chain rule :

dJ / dW = (dJ / dAl) * (dAl/ dZ) * (dZ/dW)

(sorry for the lousy formatting, these are partial derivatives chained up).

And then to derive each term.
Note that I did not even mention the L step in this equation.

Then, the 1/m comes naturally from the first term.

For example, with mse loss, we have

dJ / dAl = (y - a) / m,

a gradient vector.

I was kind of expecting the first term with its 1/m, which is then applied to both dW and db according to the chain rule.

Cheers & thanks again for these clarifications.

Colin

Matim · April 21, 2021, 1:08pm

Thanks for the responses, this distinction between L and J is key; I think what confused me was thinking of J as function of multiple variables rather than sum of indiv. losses.

To illustrate, I’ll go back to A[L] where L is the output layer. This is basically a row 1xm vector. So when I see the expression dJ/dA[L], I interpreted this to mean as the standard gradient vector, i.e. partial derivative of J with respect to each coordinate of A[L]. With J, this works fine for me, because J in this context can be viewed as a multivariate function of aL, …, aL. Thinking of dL/dA[L] in this way is difficult.

I think what’s really happening is we first work out all the gradients for some arbitrary individual example from the set, and then use the fact that J is a nice linear combination of individual losses, which makes it fairly easy to generalize.

paulinpaloalto · April 21, 2021, 7:58pm

Good points! The easy way to see how the derivative of J works out is to remember that J is an average. The derivative of the average is the average of the derivatives. Think about it for a second and that should make sense. Taking derivatives is a “linear” operation and taking the average is also linear.

Topic		Replies	Views
Course 1 Week 4 programming assignment #2 error Neural Networks and Deep Learning coursera-platform	9	574	September 30, 2022
week-4-Backpropagation Neural Networks and Deep Learning week-4 , coursera-platform	8	27	November 16, 2024
Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks Neural Networks and Deep Learning coursera-platform	2	677	May 17, 2023
Dl/DA Gradient First Input Same or Not for All Activation Neural Networks and Deep Learning coursera-platform	2	540	June 20, 2021
Week 4 backward propagation da[l-1] derivation Neural Networks and Deep Learning coursera-platform	2	834	July 24, 2021

Backpropagation formulas

Related topics