Week 1 - Learning Rate in Activation Function

albert_c · July 13, 2025, 8:50am

In the intro video, Andrew says the sigmoid function makes the Gradient Descent converging process slow because it has no slope at the extremes.
But the sigmoid is an activation function, shouldn’t Gradient Descent be applied to the error function instead?
Thanks for clarifying this!

gent.spah · July 13, 2025, 10:47am

Forward and backward propagation uses the entire model structure including the linear part and the activation function. You will learn about these terms soon in that course.

paulinpaloalto · July 13, 2025, 4:09pm

Yes, the gradients are the derivatives of the loss (cost) function. But we take the derivatives with respect to each of the parameters of the model (all the W and b values), which means that we have a composite function that includes all the intermediate functions (affine functions and activation functions) at all layers and including the loss and cost at the very last step.

Of course taking the derivative of a composite function involves the Chain Rule, which means you are taking the products of the derivatives at each layer. If you have a product and one of the factors is very close to zero, then it decreases the magnitude (absolute value) of the product.

Topic		Replies	Views
Activation function Lecture Video 4:13 Neural Networks and Deep Learning coursera-platform	1	545	June 17, 2021
Why does the activation function's slope matters instead of its log? [Week 3, Activation Function's Video at 4:20] Neural Networks and Deep Learning coursera-platform	2	551	September 1, 2021
W4_QUIZ_dZL = AL - Y Neural Networks and Deep Learning coursera-platform	4	704	January 26, 2023
Clarfication on Gradient descent for neural networks Neural Networks and Deep Learning coursera-platform	1	633	August 16, 2023
Week 2 Derivatives: Logistic Regression as a neural network Neural Networks and Deep Learning coursera-platform	2	613	September 29, 2021

Week 1 - Learning Rate in Activation Function

Related topics