Course 1: Week 3 (backpropagation intuition)

Not sure if you guys have figured out how dz[1] is calculated but here is the calculation which might help someone who comes here.

So the goal is to minimize loss with respect to z1 which is dL/dz1 and this can be written as dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 using chain rule.

Remember that this term dL/da2 * da2/dz2 is loss with respect to dz2 which is dL/dz2 = a2-y. You can refer this wonderful post to know how this is derived if you are not sure.

Now our equation is (a2-y) * dz2/da1 * da1/dz1

dz2/da1 = d/da1 w2a1+b because z2 is derived from w2a1+b
derivative of w2a1+b with respect to a1 is w2

da1/dz1 = d/dz1 sigmoid(z1)
derivative of sigmoid(z1) is sigmoid(z1) * (1-sigmoid(z1))

Finally everything put together,

dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 becomes (a2-y) * w2 * sigmoid(z1) * (1-sigmoid(z1)) which Prof. Andrew has given as w2 * a2-y (which is loss with respect to z2 so named it as dz2) and the final term sigmoid(z1) * (1-sigmoid(z1)) is denoted as g prime (z1).

Hope this helps as I couldn’t use math notation but just plain text.

P.S: Please note that da1/dz1 can change depending on the activation function used. Here I have assumed activation function at hidden layer is sigmoid and in one of the assignments tanh is used. So a portion of dz1 changes.