I wonder why we suddenly bother to introduce dA in week 4? I understand that it is a part of the calculus ( dA[l]=W.T[l+1]*dZ[l+1] and dZ[l]=dA[l]*g’(z[l]) ), but why not stick to the known quantities W and dZ? We use dA[l] to derive dZ[l], but to initialize the back prop we can use dZ[L]=A[L]-Y, so we don’t really have to explicitly calculate the dA at all. Why introduce another quantity now?

Because Week 4 is the point at which we finally reach the fully general case. You can make some shortcuts with the 1 or 2 layer cases, but that no longer works so well in the general case. You need to compute dA^{[l]} at each layer, not just the output layer.

Of course there is a certain amount of discretion here as well. You could probably formulate this in different ways, but Prof Ng is teaching the class: he has chosen to formulate it in the way that he thinks makes the most sense. When you’re teaching the class, you will get to choose the formulation.