Thanks for reading my message.
My question is related to the implementation of linear_activation_backward function.
In this assignment, we have multiple ReLU activation functions in the hidden layers and one sigmoid function in the output layer.
There is a function provided for the backward propagation for the sigmoid function. Knowing that the sigmoid function in this case is used at the output layer (L), DZ[L] would be A[L] - Y which is the derivative of Z being used at Layer L.
What is sigmoid_backward function calculating then?
and why does it need dA as a parameter and the activation_cache (which is Z in this case)?
The point is that sigmoid_backward and relu_backward are calculating the formula that Raymond shows. Remember that these functions are intended to be general and it’s perfectly possible to use sigmoid in hidden layers as well, although it just happens we don’t do that.
The formula you show of AL - Y is a special case: that only applies at the output layer and it happens because they have already included the derivative of sigmoid. The activation is only sigmoid at the output layer in general. See the derivation of that on the famous thread from Eddy.
Just thought of mentioning that it may be clearer and may result in fewer questions on this topic if we show the derivation in the class notes posted in the course material.
Please see the derivation added to the course notes (attached file) which is basically the same as the link provided by @paulinpaloalto.
It may be better to update the class notes and show the derivation there.
Please see attached.