Why it is dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]}) and not A^{[1]} - Y like it was done for dz^{[2]} = A^{[2]} - Y.
My understanding :
-
We are not doing A^{[1]} - Y because it is not directly connected to the true labels directly, instead error is being propogated/passed on backwards from layer 2(output layer) to layer 1
-
dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]}) means \frac{dLoss}{{dz^{[1]}}} = {W^{[2]}}^{T} \frac{dLoss}{{dz^{[2]}}} \times \frac{dA^{[1]}}{{z^{[1]}}}.
-
\frac{dA^{[1]}}{{z^{[1]}}} Represents how much A^{[1]} changes w.r.t z^{[1]} which(A^{[1]}) in turn is input to the output layer responsible for the loss \frac{dLoss} {{dz^{[2]}} }, so \frac{dA^{[1]}}{{z^{[1]}}} kind of scales the error(by multiplying \frac{dLoss}{{dz^{[2]}}} \times \frac{dA^{[1]}}{{z^{[1]}}}) in the output layer and the whole thing is multiplied by {W^{[2]}}^{T} to pass a portion of the error(contributed by layer2) to layer1.
Please clarify and correct me.
Things work out differently at the output layer because of the way that the derivative of sigmoid and the derivative of the cross entropy loss function work nicely together. Here’s a thread which shows that. At the inner layers of the network, you don’t get that nice simplification.
But note that derivations involving calculus are beyond the scope of these courses. If you have some math background, here’s a thread with links to more information on the derivations of back propagation.
Ok now i understand the derivation behind \frac{dLoss}{{dz^{[1]}}} = {W^{[2]}}^{T} \frac{dLoss}{{dz^{[2]}}} \circ \frac{dA^{[1]}}{{z^{[1]}}}
\frac{d \text{Loss}}{d z^{[1]}} Can be expressed using the chain rule as:
\frac{d \text{Loss}}{d z^{[1]}} = \frac{d \text{Loss}}{d z^{[2]}} \cdot \frac{d z^{[2]}}{d A^{[1]}} \cdot \frac{d A^{[1]}}{d z^{[1]}}
-
\frac{d \text{Loss}}{d z^{[2]}}:
\frac{d \text{Loss}}{d z^{[2]}} = A^{[2]} - Y
-
\frac{d z^{[2]}}{d A^{[1]}}:
-
Since z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}, the derivative of z^{[2]} w.r.t A^{[1]} is just W^{[2]}
\frac{d z^{[2]}}{d A^{[1]}} = W^{[2]}
- \frac{d A^{[1]}}{d z^{[1]}}:
\frac{d A^{[1]}}{d z^{[1]}} = \frac{d g^{[1]}(z^{[1]})}{d z^{[1]}} = g^{[1]'}(z^{[1]})
Putting these terms together, we get:
\frac{d \text{Loss}}{d z^{[1]}} = \frac{d \text{Loss}}{d z^{[2]}} \cdot W^{[2]} \circ g^{[1]'}(z^{[1]})
Each term has the following shape:
-
\frac{d \text{Loss}}{d z^{[2]}}: This is the gradient of the loss with respect to the pre-activation z^{[2]} at the output layer. If there are n^{[2]} nodes in layer 2 and m training examples, then:
\frac{d \text{Loss}}{d z^{[2]}} \text{ has shape } (n^{[2]}, m)
-
W^{[2]}: The weight matrix for the connection from layer 1 to layer 2. If layer 1 has n^{[1]} nodes and layer 2 has n^{[2]} nodes, then:
W^{[2]} \text{ has shape } (n^{[2]}, n^{[1]})
-
\frac{d \text{Loss}}{d z^{[2]}} \cdot W^{[2]}:
- When we multiply \frac{d \text{Loss}}{d z^{[2]}} by W^{[2]}, we perform a matrix multiplication between W^{[2]} (of shape (n^{[2]}, n^{[1]})) and \frac{d \text{Loss}}{d z^{[2]}} (of shape (n^{[2]}, m)).
- The result has shape (n^{[1]}, m), which matches the number of nodes in layer 1 and the number of training examples.
-
g^{[1]'}(z^{[1]}): This is the derivative of the activation function in layer 1 with respect to z^{[1]}.
- Since z^{[1]} has shape (n^{[1]}, m), so does g^{[1]'}(z^{[1]}).
1 Like
That’s a nice explanation. Yes, all this is just the Chain Rule in action with the added twist that the objects are vectors and matrices. Forward propagation is a huge function composition: a series of functions each one feeding its output as the input to the next layer function and the loss (vector valued) and cost (scalar average of the loss) are the last two layers. Then when you want to compute the derivative of the cost, you apply the Chain Rule, peeling the onion one layer at a time from the outside in, as you showed.