In the intro video, Andrew says the sigmoid function makes the Gradient Descent converging process slow because it has no slope at the extremes.
But the sigmoid is an activation function, shouldn’t Gradient Descent be applied to the error function instead?
Thanks for clarifying this!
Forward and backward propagation uses the entire model structure including the linear part and the activation function. You will learn about these terms soon in that course.
Yes, the gradients are the derivatives of the loss (cost) function. But we take the derivatives with respect to each of the parameters of the model (all the W and b values), which means that we have a composite function that includes all the intermediate functions (affine functions and activation functions) at all layers and including the loss and cost at the very last step.
Of course taking the derivative of a composite function involves the Chain Rule, which means you are taking the products of the derivatives at each layer. If you have a product and one of the factors is very close to zero, then it decreases the magnitude (absolute value) of the product.