Why there is a need to decrease the learning rate over time? Won’t the gradient descent automatically take smaller steps because slope value will decrease as we go down the cost function curve i.e. going towards the minima?
Hello @Aryan06,
Let’s look at this slide coming from one of the W2’s videos:
It is a very common drawback for switching from Batch GD to Mini-batch GD that the cost will oscillate. Overall speaking, the cost will be decreasing but that oscillation won’t disappear just because we are approaching to the minimum, because the size of the oscillation has to do with the size of the mini-batch. The smaller the mini-batch size is, the more likely we can run into a larger oscillation. Such oscillation is bad for us because it keeps the model from really converging, instead it makes the model to wander around the minimum.
To overcome this, we want to decrease the learning rate over time in such a way that hopefully when the model is close to the minimum, the learning rate will be significantly reduced enough to effectively kill off the oscillation, because if the learning rate is small, the weight’s updating step size will be small, and therefore the change of the cost function should also be small.
Cheers,
Raymond
Thank you @rmwkwok for the explanation
You are welcome @Aryan06!