Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks

paulinpaloalto · May 17, 2023, 6:50pm

What happens at every layer is fundamentally the same from a mathematical point of view. You are computing the gradients of the parameters (W^{[l]} and b^{[l]}) which are the derivatives of the cost J w.r.t. those parameters. The reasons that the formulas look different at the output layer is just that we know the activation function, so we can specifically compute the derivatives. Here’s a thread which goes through those derivations. For the hidden layers, we are applying the Chain Rule, as you say, but we don’t know the activation function, because you have a choice at the hidden layers.

This course has been specifically designed not to require knowledge of even univariate calculus, let alone matrix calculus, in order to understand the material. That means that Prof Ng does not really show the derivations of the various back prop formulas but alludes to the Chain Rule and waves his hands. If you have the math background to understand these issues, here’s a thread with links to various material dealing with the derivation of back propagation.

Topic		Replies	Views
Back propagation derivatives Neural Networks and Deep Learning week-module-4 , coursera-platform	7	33	May 30, 2025
I don't know the difference between dZL = AL - Y and dZL = dAL .* g'(ZL) Neural Networks and Deep Learning coursera-platform	2	795	February 8, 2022
W 4 \| Quiz \| Error in Q.7 or am I just not thinking it straight? Neural Networks and Deep Learning coursera-platform	3	1023	October 22, 2022
Dl/DA Gradient First Input Same or Not for All Activation Neural Networks and Deep Learning coursera-platform	2	540	June 20, 2021
Backpropgation Advanced Learning Algorithms week-module-2	5	386	August 8, 2023

Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks

Related topics