This variable, da[l-1], just pops out of nowhere in week-4 of this course and I am still not clear of why it is needed and how the formula of its calculation come about.
da[l-1] = W[l].T @ dz[l]
Can someone shed some light on this please?
This variable, da[l-1], just pops out of nowhere in week-4 of this course and I am still not clear of why it is needed and how the formula of its calculation come about.
da[l-1] = W[l].T @ dz[l]
Can someone shed some light on this please?
Hi @khteh ,
This is based on formula (10) on section 6.1 Linear Backward.
What’s the source of your reference? The lecture notes .pdf doesn’t have this formula (10) and section 6.1 on it…
Hi @khteh ,
You should find that formula in both graded lab assignments for week 4.
You should find the hand written formula under Backward propagation for layer l
from the lecture notes.
That formula is the key to how back prop actually works, because that is the output from the calculation at layer l that feeds into and drives the computation at layer l - 1. There are three key outputs from the back prop at layer l:
dW^{[l]} and db^{[l]} which we use to perform that actual parameter update at layer l which is the goal of back propagation.
dA^{[l-1]} which passes the later gradients back to the previous layers one step (layer) at a time. That is the key point at which the Chain Rule is applied and makes the whole process work. It’s where the actual “propagation” happens, right? But we’re going backward instead of forward.
If you want to know where it comes from, that is not really covered in the lectures, because Professor Ng has designed these courses not to require knowledge of matrix calculus. But it’s not that hard to see where it arises. The point is that we are using the Chain Rule to compute the derivatives of the final cost J w.r.t. each parameter at each layer. The key point in forward propagation where the output of the previous layer feeds into the input of the next layer is this:
A^{[l]} = g^{[l]}(Z^{[l]})
Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}
If you take the derivatives there to get the Chain Rule factors at that layer, you end up with:
dA^{[l-1]} = W^{[l]T} \cdot dZ^{[l]}
As mentioned above, this derivation is not covered here. Here is a thread with links both to background information about matrix calculus and the actual derivations of back propagation.