Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks

What happens at every layer is fundamentally the same from a mathematical point of view. You are computing the gradients of the parameters (W^{[l]} and b^{[l]}) which are the derivatives of the cost J w.r.t. those parameters. The reasons that the formulas look different at the output layer is just that we know the activation function, so we can specifically compute the derivatives. Here’s a thread which goes through those derivations. For the hidden layers, we are applying the Chain Rule, as you say, but we don’t know the activation function, because you have a choice at the hidden layers.

This course has been specifically designed not to require knowledge of even univariate calculus, let alone matrix calculus, in order to understand the material. That means that Prof Ng does not really show the derivations of the various back prop formulas but alludes to the Chain Rule and waves his hands. If you have the math background to understand these issues, here’s a thread with links to various material dealing with the derivation of back propagation.