Back propagation why do we start from dZ2 and why transpose

In programming assignment the solution starts with derivative of dZ2 and the derivative of dZ2 is given as dZ2-Y. Should we not first calculate the derivative of A2?

Further, dW2 = (dZ2.dot(A1.T). Why are we doing a transpose here? I did not find this explanation in any of the recommended videos as well :frowning:

Thanks in advance.

1 Like

We did calculate the derivative of A^{[2]} and it is expressed as part of the dZ^{[2]}. This is all just the Chain Rule in action. Remember what Prof Ng’s shorthand notation means:

dZ2 = \displaystyle \frac {\partial L}{\partial Z^{[2]}}
dA2 = \displaystyle \frac {\partial L}{\partial A^{[2]}}

Where L is the vector cost function, not the scalar cost J, which is the average of the L values across the samples.

Of course we have:

A^{[2]} = \sigma(Z^{[2]})

Then by the chain rule:

dZ^{[2]} = \displaystyle \frac {\partial L}{\partial Z^{[2]}} = \frac {\partial L}{\partial A^{[2]}}\frac {\partial A^{[2]}}{\partial Z^{[2]}}

Here’s a thread by Mubsi and Eddy showing how to get from that formula to the simplified result:

dZ^{[2]} = A^{[2]} - Y

As to the question about the formula for dW^{[2]} note that requires matrix calculus and Prof Ng has specifically designed these courses not to require knowledge of calculus. So there’s good news and bad news: the good news is you don’t need to know calculus, but the bad news is you just have to accept the formulas as he gives them to you. If you want to dig deeper, here’s a thread that links to the derivations and other information you need in order to understand this.

3 Likes

Thanks for the detailed reply. I will try to digest this.