I’m currently working on implementing a neural network using the sigmoid activation function and the binary cross-entropy cost function. In my implementation, I’ve noticed that the gradient calculation for the last layer activation differs from other layers. I’m seeking a clear explanation or proof for this discrepancy.
Specifically, I’m curious about why the gradient calculation for the last layer is different and whether it’s influenced by the choice of cost function and activation function. Additionally, I would appreciate insights into why we can’t simply use the formula dAL = d(cost) / d(AL) = d(cost) / d(ZL) * d(ZL) / d(AL) for calculating gradients in all layers. so we don’t want to find da_prev using this np.dot(W.T, dZ), we compute it using a different approach. I’m eager to understand the rationale behind this choice.
Any explanations, proofs, or insights into the reasons behind these gradient calculations would be greatly appreciated. Thank you for your help and guidance
In concept, the equation for the output layer is different because we have labeled data. So we can directly compute the output error term. Note that we don’t have labels for the hidden layers, that’s why “backpropagation of errors” is needed.
Mathematically, you can find many discussions about backpropagation in DLS Course 1 by using the forum search tool for the word “backpropagation”.
What happens at every layer is fundamentally the same from a mathematical point of view. You are computing the gradients of the parameters (W^{[l]} and b^{[l]}) which are the derivatives of the cost J w.r.t. those parameters. The reasons that the formulas look different at the output layer is just that we know the activation function, so we can specifically compute the derivatives. Here’s a thread which goes through those derivations. For the hidden layers, we are applying the Chain Rule, as you say, but we don’t know the activation function, because you have a choice at the hidden layers.
This course has been specifically designed not to require knowledge of even univariate calculus, let alone matrix calculus, in order to understand the material. That means that Prof Ng does not really show the derivations of the various back prop formulas but alludes to the Chain Rule and waves his hands. If you have the math background to understand these issues, here’s a thread with links to various material dealing with the derivation of back propagation.