In week 4 professor Ng describes the building blocks for a single training iteration for forward pass and backward pass. I understand why we should calculate dw and db: their rates of change eventually affect the result cost function (in the next iteration, through the update of W and B).
I don’t however understand why it’s necessary to calculate the derivative of Z? Also, why don’t we bother to calculate the derivative of A instead?
Hi, @Sua. The short answer in that it (taking thee derivative of Z is a necessary step in computing the the derivative of A through the chain rule of calculus . Recall that A is a composite function: A^{l} = g\left(Z^{l}\right) where Z^{l} = WA^{l-1} + b, where g\left(\cdot\right) is the activation function. We can write this more generally (for any given layer) as A = g\left(Z(W, b)\right). Using the chain rule:
\frac{dA}{dW} = \frac{dg}{dZ} \frac{dZ}{dW} and \frac{dA}{db} = \frac{dg}{dZ} \frac{dZ}{db } .
okay, so I get that part, but I think then the fundamental question stands: why do we bother taking the derivative of A?
Like, I try explaining it to myself but I can’t finish the sentence: “Knowing the rate of change of the Activation layer allows us to…”
It seems to me that if we have the weights and bias, and we know how those affect the loss, that’s all we need, right?
But it all goes through the activation function at every layer, right? We are composing functions and then using the Chain Rule to compute the gradients (derivatives). You need to take Ken’s point and apply it at the level of the cost. If we want the derivative of the cost J w.r.t. some parameter, we need the derivative of every function between that parameter and the cost, right?