Not sure if you guys have figured out how dz[1] is calculated but here is the calculation which might help someone who comes here.
So the goal is to minimize loss with respect to z1 which is dL/dz1 and this can be written as dL/da2
* da2/dz2
* dz2/da1
* da1/dz1
using chain rule.
Remember that this term dL/da2
* da2/dz2
is loss with respect to dz2 which is dL/dz2
= a2-y
. You can refer this wonderful post to know how this is derived if you are not sure.
Now our equation is (a2-y)
* dz2/da1
* da1/dz1
dz2/da1
= d/da1 w2a1+b
because z2
is derived from w2a1+b
derivative of w2a1+b
with respect to a1
is w2
da1/dz1
= d/dz1 sigmoid(z1)
derivative of sigmoid(z1)
is sigmoid(z1) * (1-sigmoid(z1))
Finally everything put together,
dL/da2
* da2/dz2
* dz2/da1
* da1/dz1
becomes (a2-y)
* w2
* sigmoid(z1) * (1-sigmoid(z1))
which Prof. Andrew has given as w2
* a2-y
(which is loss with respect to z2 so named it as dz2) and the final term sigmoid(z1) * (1-sigmoid(z1))
is denoted as g prime (z1).
Hope this helps as I couldn’t use math notation but just plain text.
P.S: Please note that da1/dz1
can change depending on the activation function used. Here I have assumed activation function at hidden layer is sigmoid and in one of the assignments tanh is used. So a portion of dz1
changes.