I am at the last optional lecture
Someone please explain, how did we got to this result? We know the expected output of only the final layer, how would be propagate that error in the hidden layer?
That is just an application of the Chain Rule between layer 1 and layer 2. Prof Ng has specifically designed this course not to require knowledge of calculus, so he just presents the formulas and does not show how to derive them. Here’s a thread with links to the derivations.
@paulinpaloalto thank you so much I was stuck there. I didn’t tried doing by the chain rule. I thought you would somehow need to know the error in the preceding layers, but you only know the desired outcome of the final layer.
This is what I did, from the help of the thread.
The point is that the dZ^{[l]} formula is just between two layers, so it ends up being one factor in the overall calculation of the gradients we actually care about which are dW^{[l]} and db^{[l]} and those are w.r.t. the final cost J and thus involve the multiplication of the chain rule factors at each layer.
Hey @paulinpaloalto could you explain how (dz[2]) w[2] * g[1]' (z[1]) = dz[1]
, gets us to dz[1] = (w[2])T dz[2] * g[1]' (z[1])
. why the transpose?
Did you solve this problem?
The simple answer is that the dimensions don’t work if you don’t include the transpose. If you want to understand why it comes out that way and more about the math behind that, please have a look through the links provided on this thread, which was also given earlier in this thread.