Can somebody explain to me the mathematical reasoning behind why dz[1] = W[2].T * dz[2] * g[1]'(z[1]). I understand how the derivatives chain for the final output layer, but I’m stumped with how the dz[2] and dz[1] are chained during backprop as the layer transitions from final to hidden layer. The instructor didn’t seem to explain this part even in the optional “Backpropagation Intuition” lecture video other than the dimensions of these derivatives and without the mathematical intuition. I have attached a screenshot and circled the above equation in purple.
Oh, now that I think about it I think I understand now.
z[2] = W[2] * a[1] + b[2] (Place a[1] instead of X)
Then dz[2]/da[1] would be W[2]
According to the chain rule:
dL/dz[2] * dz[2]/da[1] * da[1]/dz[1] = dL/dz[1] (which is dz[1])
dL/dz[2] = dz[2]
dz[2]/da[1] = W[2]
da[1]/dz[1] = g[1]‘(z[1])
dz[1] = W[2].T * dz[2] * g[1]’(z[1]) (apply transpose to W[2] to match matrix dim)
So the formula seems to make sense.
Yes, that all looks right. All this is just the Chain Rule “writ large”. The other thing that occasionally causes confusion is that Prof Ng’s notation is ever so slightly ambiguous. When he writes dZ or dA or dW, it means slightly different things: you have to keep track what the “numerator” is meaning, what he’s taking the partial derivative of. In all cases except the final gradients of W and b that we actually are going to apply, the gradients are of L the vector loss. But for dW^{[l]} and db^{[l]}, those and only those are partial derivatives of J the scalar cost, which of course is the average of L across the samples.
Of course the other high level point here is that these courses are very specifically designed not to require knowledge of even univariate calculus, let alone matrix calculus. So Prof Ng does not generally go into the derivations and only talks about intuitions based on the meaning of derivatives. The good news is you don’t need to know calculus, but of course that means the bad news is you have to take his word for everything.
There are plenty of resources available on the web for people like you who have the math background to understand the derivations. Here’s a thread with some links and you can find plenty more with a little searching or by reading one of the textbooks like Goodfellow et alia referenced on this bibliography thread.
Thank you for your answer. I’m just starting out in learning deep learning and without much math background this course has been absolutely helpful. Sometimes, however, I do yearn for a more harmonious explanations beyond just pure applications. It just makes the topic more interesting and the memory more long-lasting in my opinion. I’ve looked at some of the resources that you have referred me to and they seem to be go much deeper in this aspect. So, thank you very much for your advice and recommendations! I appreciate it
I’m glad the links were useful and thanks for more background on how you are approaching your DL learning journey. It turns out that DeepLearning just published a brand new Specialization on Coursera called Mathematics for Machine Learning (M4ML). I haven’t had a chance to take it yet, although I’m hoping to find time to do that. That would be another resource if you’d like to learn more of the related math. From what I can see just looking the syllabus, some of what they cover is prerequisite knowledge that you need for DLS (e.g. Linear Algebra, although they also cover some more advanced Linear Algebra topics that you don’t need for DLS specifically) and then it also covers some calculus and probability and statistics. Have a look and see if it catches your interest.