Back propagation why do we start from dZ2 and why transpose

We did calculate the derivative of A^{[2]} and it is expressed as part of the dZ^{[2]}. This is all just the Chain Rule in action. Remember what Prof Ng’s shorthand notation means:

dZ2 = \displaystyle \frac {\partial L}{\partial Z^{[2]}}
dA2 = \displaystyle \frac {\partial L}{\partial A^{[2]}}

Where L is the vector cost function, not the scalar cost J, which is the average of the L values across the samples.

Of course we have:

A^{[2]} = \sigma(Z^{[2]})

Then by the chain rule:

dZ^{[2]} = \displaystyle \frac {\partial L}{\partial Z^{[2]}} = \frac {\partial L}{\partial A^{[2]}}\frac {\partial A^{[2]}}{\partial Z^{[2]}}

Here’s a thread by Mubsi and Eddy showing how to get from that formula to the simplified result:

dZ^{[2]} = A^{[2]} - Y

As to the question about the formula for dW^{[2]} note that requires matrix calculus and Prof Ng has specifically designed these courses not to require knowledge of calculus. So there’s good news and bad news: the good news is you don’t need to know calculus, but the bad news is you just have to accept the formulas as he gives them to you. If you want to dig deeper, here’s a thread that links to the derivations and other information you need in order to understand this.

3 Likes