Week 1 - importance/advantage of ReLU

SRISHTI_GUREJA · September 29, 2021, 11:04am

Hi,
It is mentioned in one of the videos that use of ReLU as an activation function instead of sigmoid provides faster convergence in gradient descent. I’m unable to get an intuitive understanding of this point.
our goal in GD is to find the set of parameters for which the cost function is minimum. how do activation functions come there in the picture? Kindly explain or maybe provide resources that should be helpful.

jonaslalin · September 30, 2021, 1:58pm

From Prof Andrew Ng’s lecture, we have

Here you see that dW and db depend on dz, which in turn contains a multiplication with the derivative of the activation function at z.

Since the derivative of the sigmoid is at most 0.25, and for relu it is either 0 or 1, we have faster convergence for relu, i.e., weights and biases get a stronger signal for how to update. The sigmoid derivative vanishes for values far from zero, which means that even with a big learning rate, after multiplying dW and db with the learning rate for parameter updates, the signal is still very weak compared to relu.

Topic		Replies	Views
When calculating the derivative of dW, why do you add it to dZ * X across all m training sets? Neural Networks and Deep Learning	3	640	December 18, 2022
ReLU vs Sigmoid function Neural Networks and Deep Learning week-1	2	29	December 24, 2024
DL and NN course1 Week#3: Understanding Activation functions Neural Networks and Deep Learning week-3	2	30	March 4, 2025
ReLu activation function Vs sigmoid function Neural Networks and Deep Learning	2	555	June 15, 2022
What is the role of ReLu derivative? Neural Networks and Deep Learning week-3	3	279	May 4, 2024

Week 1 - importance/advantage of ReLU

Related topics