The point is that sigmoid_backward
and relu_backward
are calculating the formula that Raymond shows. Remember that these functions are intended to be general and it’s perfectly possible to use sigmoid in hidden layers as well, although it just happens we don’t do that.
The formula you show of AL - Y is a special case: that only applies at the output layer and it happens because they have already included the derivative of sigmoid. The activation is only sigmoid at the output layer in general. See the derivation of that on the famous thread from Eddy.