Hello,

At the backprop part of the diagram we first compute derrivative of loss function (dL/dA). And than at the first backprop box 1,

we calculate dZ = dA[2] * g’(Z[2]) and send it to the next box.

At the next box we calculate dW[1] = dZ[2] * A[1]T, db=np.sum(dZ[2]) and dA[1]=WTd[2] …and so on. So my question is why the output of the 1 box is dL/dA[2]. Shouldnt it be dL/dZ[2] ? Also at the programming side we are sending variables like that. I mean we are sending dZ[2] to the next function.

I understand your concern! You’re right to question this, and it’s a great observation. The output of the first backprop box should indeed be dL/dZ[2], not dL/dA[2]. The derivative of the loss function with respect to the activation A[2] is not what we need to pass on to the next layer. Instead, we need to compute the derivative with respect to the weighted sum Z[2], which is the input to the activation function.

Think of it like this: we’re trying to measure how much each parameter contributes to the final error. At each layer, we need to compute the error gradient with respect to the inputs of that layer, not the outputs. So, in this case, we need dL/dZ[2] to compute the error gradients for the weights and biases of the second layer. By passing dL/dZ[2] to the next box, we can then compute dW[1] and db[1] correctly.

I hope this clears up any confusion, and please let me know if you have further questions!

Best Regards,

Muhammad John Abbas

The output of the box you shared is correct. Regarding your question:

So my question is why the output of the 1 box is dL/dA[2]. Shouldnt it be dL/dZ[2]

We have a chain rule:

\frac{dL}{dZ2} = \frac{dL}{dA2} \times \frac{dA2}{dZ2}...

So, instead of passing the \frac{dL}{dA2} and \frac{dA2}{dZ2} separately, we just pass \frac{dL}{dZ2} that is \frac{dL}{dA2} \times \frac{dA2}{dZ2}, hence it counts the effect of both.