Backpropagation of dz[1] is
W[2]T dz[2] * g[1]’ (z[1])
g[1]’(Z[1]) can be any activation function.
dz[1] is calculated as dL/dz[1] (Derivative of DL : Loss, w.r.t dz[i])
I am confused with how this equation is obtained: W[2]T dz[2] * g[1]’ (z[1]).
There was no explanation of this in the videos as well. (As far as I know).
Prof Ng has specifically designed these courses so that they do not require the students to know any calculus (even univariate calculus, let alone matrix calculus), so he does not cover the derivations of a lot of the formulas which involve calculus. If you would like to dig deeper and have the math background, there are lots of resources available. Here’s a local thread with a bibliography, which includes text books that cover the actual math behind all this. One book that is more math oriented is Goodfellow et al, which is listed there.
Here are some good websites that will also cover the derivations of back propagation:
The links above should cover everything, but there are plenty more good sources for this type of information out on the web. Googling around a bit should find plenty, although you also get a lot of Medium.com articles by people who don’t really know that much. Stanford has been particularly generous in making a lot of the supporting material for their graduate CS courses related to ML/DL publicly available. In addition to CS231n material listed above, here are a few more:
Overall CS229 website. This is the class that Prof Andrew taught for many years that gives the Stanford CS Grad Student version of the original Stanford Machine Learning course on Coursera. He’s handed it off to some other professors now, but it’s a great place to start.
Hello friend,
Did you figure out why dz[1] = W[2]T dz[2] * g[1]’ (z[1])? I was thinking about it too the entire time I was watching that lecture video. With the chain rule in mind, it seems that Prof. Ng got da[1] = W[2]T dz[2]. I’m not sure how he got that.
I am wondering too if anyone has figured out. I am the type of person who feels a lingering sense of malaise when I implement functions that I don’t completely understand—and some of the particular derivations baffle me.
I have taken multivariable calculus and linear algebra but am still struggling to figure it out, even with the Cornell resource. If one of the mentors could go through the process of outlining it with Prof. Ng’s notation/framework, that would be incredibly helpful! (Sort of like the dL/dZ logistic regression derivation—which is the most viewed post in the thread, for what it’s worth )
Not sure if you guys have figured out how dz[1] is calculated but here is the calculation which might help someone who comes here.
So the goal is to minimize loss with respect to z1 which is dL/dz1 and this can be written as dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 using chain rule.
Remember that this term dL/da2 * da2/dz2 is loss with respect to dz2 which is dL/dz2 = a2-y. You can refer this wonderful post to know how this is derived if you are not sure.
Now our equation is (a2-y) * dz2/da1 * da1/dz1
dz2/da1 = d/da1 w2a1+b because z2 is derived from w2a1+b
derivative of w2a1+b with respect to a1 is w2
da1/dz1 = d/dz1 sigmoid(z1)
derivative of sigmoid(z1) is sigmoid(z1) * (1-sigmoid(z1))
Finally everything put together,
dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 becomes (a2-y) * w2 * sigmoid(z1) * (1-sigmoid(z1)) which Prof. Andrew has given as w2 * a2-y (which is loss with respect to z2 so named it as dz2) and the final term sigmoid(z1) * (1-sigmoid(z1)) is denoted as g prime (z1).
Hope this helps as I couldn’t use math notation but just plain text.
P.S: Please note that da1/dz1 can change depending on the activation function used. Here I have assumed activation function at hidden layer is sigmoid and in one of the assignments tanh is used. So a portion of dz1 changes.
dear friends ,
Professor Andrew and the mentors are doing great work to generalize this specialization for people how doesn’t understands maths but if you want the explanation :
i think the key to master backpropagation, u need to understand the matrices multiplication and equality rules , if 2 matrices are equal they need to have the same shape , you can’t mutilply 2 matrices if they don’t have the same shape, or a shape like ( (a,b)*(a,c)) if they have a shape like that : ((a,b) * (c,a)) you need to force the second to be (a,c) and of course we can add python broadcasting to the first 2
Mr @paulinpaloalto can verify it
Happy learning !
This derivation confused me so much that I spent the whole day reading up on derivatives. This seems to answer my confusion, and I hope it will answer someone else’s as well some day.
I believe the issue stems from the assumption that z and dz (dL/dz) have the same dimensions, whereas the derivative of a scalar with respect to a column vector is a row vector.
If we take that to be true, then the dimensions match as is shown in the picture below :
Because those gradients are w.r.t. J, the cost. Remember that J is the average of the L loss values across all the samples. That is where the factor of \frac {1}{m} comes from.
The notation Prof Ng uses is slightly ambiguous. For example the dZ^{[l]} values there do not have the factor of \frac {1}{m} because they are just “Chain Rule” factors at a given layer that are used to later compute dW^{[l]} and db^{[l]}.
For example the first value there (which is special because it is for the output layer) is really: