Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks

gratus_richard · May 17, 2023, 4:49pm

I’m currently working on implementing a neural network using the sigmoid activation function and the binary cross-entropy cost function. In my implementation, I’ve noticed that the gradient calculation for the last layer activation differs from other layers. I’m seeking a clear explanation or proof for this discrepancy.

Specifically, I’m curious about why the gradient calculation for the last layer is different and whether it’s influenced by the choice of cost function and activation function. Additionally, I would appreciate insights into why we can’t simply use the formula dAL = d(cost) / d(AL) = d(cost) / d(ZL) * d(ZL) / d(AL) for calculating gradients in all layers. so we don’t want to find da_prev using this np.dot(W.T, dZ), we compute it using a different approach. I’m eager to understand the rationale behind this choice.

Any explanations, proofs, or insights into the reasons behind these gradient calculations would be greatly appreciated. Thank you for your help and guidance

TMosh · May 17, 2023, 5:58pm

In concept, the equation for the output layer is different because we have labeled data. So we can directly compute the output error term. Note that we don’t have labels for the hidden layers, that’s why “backpropagation of errors” is needed.

Mathematically, you can find many discussions about backpropagation in DLS Course 1 by using the forum search tool for the word “backpropagation”.

paulinpaloalto · May 17, 2023, 6:50pm

What happens at every layer is fundamentally the same from a mathematical point of view. You are computing the gradients of the parameters (W^{[l]} and b^{[l]}) which are the derivatives of the cost J w.r.t. those parameters. The reasons that the formulas look different at the output layer is just that we know the activation function, so we can specifically compute the derivatives. Here’s a thread which goes through those derivations. For the hidden layers, we are applying the Chain Rule, as you say, but we don’t know the activation function, because you have a choice at the hidden layers.

This course has been specifically designed not to require knowledge of even univariate calculus, let alone matrix calculus, in order to understand the material. That means that Prof Ng does not really show the derivations of the various back prop formulas but alludes to the Chain Rule and waves his hands. If you have the math background to understand these issues, here’s a thread with links to various material dealing with the derivation of back propagation.

Topic		Replies	Views
Backpropagation formulas Neural Networks and Deep Learning	7	1040	April 21, 2021
Dl/DA Gradient First Input Same or Not for All Activation Neural Networks and Deep Learning	2	540	June 20, 2021
W 4 \| Quiz \| Error in Q.7 or am I just not thinking it straight? Neural Networks and Deep Learning	3	997	October 22, 2022
week-4-Backpropagation Neural Networks and Deep Learning week-4	8	25	November 16, 2024
Deep learning from a mathematical view Neural Networks and Deep Learning	2	658	November 27, 2021

Week4- assignment 2- Difference in gradient calculation for the last layer activation in neural networks

Related topics