Course 1: Week 3 (backpropagation intuition)

Dear Mentors/Friends,

Backpropagation of dz[1] is
W[2]T dz[2] * g[1]’ (z[1])
g[1]’(Z[1]) can be any activation function.

dz[1] is calculated as dL/dz[1] (Derivative of DL : Loss, w.r.t dz[i])

I am confused with how this equation is obtained: W[2]T dz[2] * g[1]’ (z[1]).
There was no explanation of this in the videos as well. (As far as I know).

Could anyone explain this?

20 Likes

Prof Ng has specifically designed these courses so that they do not require the students to know any calculus (even univariate calculus, let alone matrix calculus), so he does not cover the derivations of a lot of the formulas which involve calculus. If you would like to dig deeper and have the math background, there are lots of resources available. Here’s a local thread with a bibliography, which includes text books that cover the actual math behind all this. One book that is more math oriented is Goodfellow et al, which is listed there.

Here are some good websites that will also cover the derivations of back propagation:

Here’s a website from Cornell that covers the derivation.

Here’s a good introduction to the matrix calculus you need in order to follow the above.

The Matrix Cookbook from Univ of Waterloo is also a valuable resource for general Linear Algebra topics as well as matrix calculus.

Here are some notes from Stanford CS231n that give a good overview and insights on back propagation.

Here’s a bit deeper dive on the math also from Stanford CS231n.

Here are notes from EECS 442 at Univ of Michigan.

Mentor Jonas Slalin also covers all this and more on his website. That’s just the first page in his series.

23 Likes

Thanx a lot Paul for sharing the resources. Thank you again.

1 Like

The links above should cover everything, but there are plenty more good sources for this type of information out on the web. Googling around a bit should find plenty, although you also get a lot of Medium.com articles by people who don’t really know that much. Stanford has been particularly generous in making a lot of the supporting material for their graduate CS courses related to ML/DL publicly available. In addition to CS231n material listed above, here are a few more:

The lectures from CS231n as a YouTube channel.

CS224n top level site with lots of links.

CS230 syllabus.

Deep learning notes (“cheat sheet”) from CS229.

Overall CS229 website. This is the class that Prof Andrew taught for many years that gives the Stanford CS Grad Student version of the original Stanford Machine Learning course on Coursera. He’s handed it off to some other professors now, but it’s a great place to start.

9 Likes

Hello friend,
Did you figure out why dz[1] = W[2]T dz[2] * g[1]’ (z[1])? I was thinking about it too the entire time I was watching that lecture video. With the chain rule in mind, it seems that Prof. Ng got da[1] = W[2]T dz[2]. I’m not sure how he got that.

Thank you,
Nay

6 Likes

I am wondering too if anyone has figured out. I am the type of person who feels a lingering sense of malaise when I implement functions that I don’t completely understand—and some of the particular derivations baffle me.

I have taken multivariable calculus and linear algebra but am still struggling to figure it out, even with the Cornell resource. If one of the mentors could go through the process of outlining it with Prof. Ng’s notation/framework, that would be incredibly helpful! (Sort of like the dL/dZ logistic regression derivation—which is the most viewed post in the thread, for what it’s worth :grinning:)

10 Likes

It would be great if anyone can share how to derive dZ1 and dA1. Thanks a lot!

2 Likes

Not sure if you guys have figured out how dz[1] is calculated but here is the calculation which might help someone who comes here.

So the goal is to minimize loss with respect to z1 which is dL/dz1 and this can be written as dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 using chain rule.

Remember that this term dL/da2 * da2/dz2 is loss with respect to dz2 which is dL/dz2 = a2-y. You can refer this wonderful post to know how this is derived if you are not sure.

Now our equation is (a2-y) * dz2/da1 * da1/dz1

dz2/da1 = d/da1 w2a1+b because z2 is derived from w2a1+b
derivative of w2a1+b with respect to a1 is w2

da1/dz1 = d/dz1 sigmoid(z1)
derivative of sigmoid(z1) is sigmoid(z1) * (1-sigmoid(z1))

Finally everything put together,

dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 becomes (a2-y) * w2 * sigmoid(z1) * (1-sigmoid(z1)) which Prof. Andrew has given as w2 * a2-y (which is loss with respect to z2 so named it as dz2) and the final term sigmoid(z1) * (1-sigmoid(z1)) is denoted as g prime (z1).

Hope this helps as I couldn’t use math notation but just plain text.

P.S: Please note that da1/dz1 can change depending on the activation function used. Here I have assumed activation function at hidden layer is sigmoid and in one of the assignments tanh is used. So a portion of dz1 changes.

20 Likes

Excellent write-up! Thank you so much!

1 Like

Please correct if anything wrong.

15 Likes

I got the same answer as well. Instead of A[2] - Y at the end of dz[1], I used dz[2] to simplify. You have a beautiful handwriting.

2 Likes

i’'m still puzzled with the transpose in the equations, did any one managed to understand this ?

3 Likes

dear friends ,
Professor Andrew and the mentors are doing great work to generalize this specialization for people how doesn’t understands maths but if you want the explanation :

i think the key to master backpropagation, u need to understand the matrices multiplication and equality rules , if 2 matrices are equal they need to have the same shape , you can’t mutilply 2 matrices if they don’t have the same shape, or a shape like ( (a,b)*(a,c)) if they have a shape like that : ((a,b) * (c,a)) you need to force the second to be (a,c) and of course we can add python broadcasting to the first 2
Mr @paulinpaloalto can verify it
Happy learning !

4 Likes

I am not sure but I think this is a great question.

2 Likes

Thank You. I had a hard time understanding how they came up with this formula.

1 Like

Hello everyone,

This derivation confused me so much that I spent the whole day reading up on derivatives. This seems to answer my confusion, and I hope it will answer someone else’s as well some day.

I believe the issue stems from the assumption that z and dz (dL/dz) have the same dimensions, whereas the derivative of a scalar with respect to a column vector is a row vector.

If we take that to be true, then the dimensions match as is shown in the picture below :

Good luck and have fun

4 Likes

One good source is that:

Chapter 6, it explains back prop. quite well and clear
I found (attached) images also very helpful


3 Likes

This is beautiful! Thanks a bunch

1 Like

why is dw[2], db[2], dw[1], db[1] obtained by dividing by m?

1 Like

Because those gradients are w.r.t. J, the cost. Remember that J is the average of the L loss values across all the samples. That is where the factor of \frac {1}{m} comes from.

The notation Prof Ng uses is slightly ambiguous. For example the dZ^{[l]} values there do not have the factor of \frac {1}{m} because they are just “Chain Rule” factors at a given layer that are used to later compute dW^{[l]} and db^{[l]}.

For example the first value there (which is special because it is for the output layer) is really:

dZ^{[2]} = \displaystyle \frac {\partial L}{\partial Z^{[2]}}

But then we have:

dW^{[2]} = \displaystyle \frac {\partial J}{\partial W^{[2]}}

Notice the difference of J versus L in the numerator there.

5 Likes