While going through the lecture video, I tried to refer the math used in calculations from Forward and Back Propagation.

- I noticed that prof Andrew mentioned about caching z, w, b from each layer but did not mention anything about variable a. But the math explicitly uses a but not z in the formulae
- I was also trying to make sense of his statement that in Back Prop. the first building block from the end takes da[l] as input and outputs da[l-1]. But I do not find any support for this statement in the math

Please let me know if I am missing something. Attached are few screen shots for your reference.

You’ll notice that the formulas do include both A and Z, but it is the A from the *previous* layer. When you get to the assignment and see how the caches are actually constructed, you will see that both values are cached.

For question 2), I think it’s probably just a misintrepetation of what he says. Notice that dA shows up in the pictures, but nowhere in the formulas. At the output layer you start with A^{[L]} and that gives you dZ^{[L]}, which is later used to compute dZ^{[L-1]} and hence dW^{[L-1]}. The whole process is just a huge serial application of the Chain Rule. Since all the derivatives are w.r.t. J, the output of the very last function in the chain, everything at a given layer depends on all the later layers. Remember that in Prof Ng’s simplified notation:

dW^{[l]} = \displaystyle \frac {\partial J}{\partial W^{[l]}}

All the gradients are partial derivatives of J w.r.t. the parameter in question.

Thank you for the detailed explanation