Hi,
It is mentioned in one of the videos that use of ReLU as an activation function instead of sigmoid provides faster convergence in gradient descent. I’m unable to get an intuitive understanding of this point.
our goal in GD is to find the set of parameters for which the cost function is minimum. how do activation functions come there in the picture? Kindly explain or maybe provide resources that should be helpful.
From Prof Andrew Ng’s lecture, we have
Here you see that dW and db depend on dz, which in turn contains a multiplication with the derivative of the activation function at z.
Since the derivative of the sigmoid is at most 0.25, and for relu it is either 0 or 1, we have faster convergence for relu, i.e., weights and biases get a stronger signal for how to update. The sigmoid derivative vanishes for values far from zero, which means that even with a big learning rate, after multiplying dW and db with the learning rate for parameter updates, the signal is still very weak compared to relu.