Not sure if you guys have figured out how dz[1] is calculated but here is the calculation which might help someone who comes here.

So the goal is to minimize loss with respect to z1 which is dL/dz1 and this can be written as `dL/da2`

* `da2/dz2`

* `dz2/da1`

* `da1/dz1`

using chain rule.

Remember that this term `dL/da2`

* `da2/dz2`

is loss with respect to dz2 which is `dL/dz2`

= `a2-y`

. You can refer this wonderful post to know how this is derived if you are not sure.

Now our equation is `(a2-y)`

* `dz2/da1`

* `da1/dz1`

`dz2/da1`

= `d/da1 w2a1+b`

because `z2`

is derived from w2a1+b

derivative of `w2a1+b`

with respect to `a1`

is `w2`

`da1/dz1`

= `d/dz1 sigmoid(z1)`

derivative of `sigmoid(z1)`

is `sigmoid(z1) * (1-sigmoid(z1))`

Finally everything put together,

`dL/da2`

* `da2/dz2`

* `dz2/da1`

* `da1/dz1`

becomes `(a2-y)`

* `w2`

* `sigmoid(z1) * (1-sigmoid(z1))`

which Prof. Andrew has given as `w2`

* `a2-y`

(which is loss with respect to z2 so named it as dz2) and the final term `sigmoid(z1) * (1-sigmoid(z1))`

is denoted as g prime (z1).

Hope this helps as I couldn’t use math notation but just plain text.

P.S: Please note that `da1/dz1`

can change depending on the activation function used. Here I have assumed activation function at hidden layer is sigmoid and in one of the assignments tanh is used. So a portion of `dz1`

changes.