Please explain $dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]})$ in backpropogation

munish259272 · November 3, 2024, 4:09am

Why it is dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]}) and not A^{[1]} - Y like it was done for dz^{[2]} = A^{[2]} - Y.

My understanding :

We are not doing A^{[1]} - Y because it is not directly connected to the true labels directly, instead error is being propogated/passed on backwards from layer 2(output layer) to layer 1
dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]}) means \frac{dLoss}{{dz^{[1]}}} = {W^{[2]}}^{T} \frac{dLoss}{{dz^{[2]}}} \times \frac{dA^{[1]}}{{z^{[1]}}}.
\frac{dA^{[1]}}{{z^{[1]}}} Represents how much A^{[1]} changes w.r.t z^{[1]} which(A^{[1]}) in turn is input to the output layer responsible for the loss \frac{dLoss} {{dz^{[2]}} }, so \frac{dA^{[1]}}{{z^{[1]}}} kind of scales the error(by multiplying \frac{dLoss}{{dz^{[2]}}} \times \frac{dA^{[1]}}{{z^{[1]}}}) in the output layer and the whole thing is multiplied by {W^{[2]}}^{T} to pass a portion of the error(contributed by layer2) to layer1.

Please clarify and correct me.

paulinpaloalto · November 3, 2024, 4:56am

Things work out differently at the output layer because of the way that the derivative of sigmoid and the derivative of the cross entropy loss function work nicely together. Here’s a thread which shows that. At the inner layers of the network, you don’t get that nice simplification.

But note that derivations involving calculus are beyond the scope of these courses. If you have some math background, here’s a thread with links to more information on the derivations of back propagation.

munish259272 · November 3, 2024, 10:57am

Ok now i understand the derivation behind \frac{dLoss}{{dz^{[1]}}} = {W^{[2]}}^{T} \frac{dLoss}{{dz^{[2]}}} \circ \frac{dA^{[1]}}{{z^{[1]}}}

\frac{d \text{Loss}}{d z^{[1]}} Can be expressed using the chain rule as:

\frac{d \text{Loss}}{d z^{[1]}} = \frac{d \text{Loss}}{d z^{[2]}} \cdot \frac{d z^{[2]}}{d A^{[1]}} \cdot \frac{d A^{[1]}}{d z^{[1]}}

\frac{d \text{Loss}}{d z^{[2]}}:

\frac{d \text{Loss}}{d z^{[2]}} = A^{[2]} - Y
\frac{d z^{[2]}}{d A^{[1]}}:

Since z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}, the derivative of z^{[2]} w.r.t A^{[1]} is just W^{[2]}

\frac{d z^{[2]}}{d A^{[1]}} = W^{[2]}

\frac{d A^{[1]}}{d z^{[1]}}:

\frac{d A^{[1]}}{d z^{[1]}} = \frac{d g^{[1]}(z^{[1]})}{d z^{[1]}} = g^{[1]'}(z^{[1]})

Putting these terms together, we get:

\frac{d \text{Loss}}{d z^{[1]}} = \frac{d \text{Loss}}{d z^{[2]}} \cdot W^{[2]} \circ g^{[1]'}(z^{[1]})

Each term has the following shape:

\frac{d \text{Loss}}{d z^{[2]}}: This is the gradient of the loss with respect to the pre-activation z^{[2]} at the output layer. If there are n^{[2]} nodes in layer 2 and m training examples, then:

\frac{d \text{Loss}}{d z^{[2]}} \text{ has shape } (n^{[2]}, m)
W^{[2]}: The weight matrix for the connection from layer 1 to layer 2. If layer 1 has n^{[1]} nodes and layer 2 has n^{[2]} nodes, then:

W^{[2]} \text{ has shape } (n^{[2]}, n^{[1]})
\frac{d \text{Loss}}{d z^{[2]}} \cdot W^{[2]}:
- When we multiply \frac{d \text{Loss}}{d z^{[2]}} by W^{[2]}, we perform a matrix multiplication between W^{[2]} (of shape (n^{[2]}, n^{[1]})) and \frac{d \text{Loss}}{d z^{[2]}} (of shape (n^{[2]}, m)).
- The result has shape (n^{[1]}, m), which matches the number of nodes in layer 1 and the number of training examples.
g^{[1]'}(z^{[1]}): This is the derivative of the activation function in layer 1 with respect to z^{[1]}.
- Since z^{[1]} has shape (n^{[1]}, m), so does g^{[1]'}(z^{[1]}).

paulinpaloalto · November 3, 2024, 3:20pm

That’s a nice explanation. Yes, all this is just the Chain Rule in action with the added twist that the objects are vectors and matrices. Forward propagation is a huge function composition: a series of functions each one feeding its output as the input to the next layer function and the loss (vector valued) and cost (scalar average of the loss) are the last two layers. Then when you want to compute the derivative of the cost, you apply the Chain Rule, peeling the onion one layer at a time from the outside in, as you showed.

Topic		Replies	Views
Course 1: Week 3 (backpropagation intuition) Neural Networks and Deep Learning coursera-platform	21	5621	April 27, 2022
Derivation of dz=da* g'(z) ? or dz= a- y? how is derivation of dz[1] and dz[2] different? Neural Networks and Deep Learning coursera-platform	10	990	June 1, 2023
W3_A1_Derivative for hidden neural layers (Backprop) Neural Networks and Deep Learning coursera-platform	5	627	February 9, 2023
Back propagation 1 box Neural Networks and Deep Learning week-module-4 , coursera-platform	3	132	May 29, 2024
BackPropagation Derivation Of 2 Layer Neural Network Neural Networks and Deep Learning week-module-3 , coursera-platform	1	260	March 3, 2024

Please explain $dz^{[1]} = {W^{[2]}}^{T} dz^{[2]} \times {g^{[1]}}^{'}(z^{[1]})$ in backpropogation

Related topics