Do we need to use a learning rate scheduler for adaptive optimizers like Adam, AdaGrad?

I searched for this question online and came across this blog: (A brief history of learning rate schedulers and adaptive optimizers) which says that we do not need to use a learning rate scheduler with optimizers like Adam while Prof. Ng said in this video (
if we reduce learning rate over time then it may help speed up learning.

I’d like to request the people in community to share some thoughts on the topic.


You will explore this question in exercise 7 of this week’s assignment.

In particular, you’ll see how learning rate decay scheduling allows Adam to achieve a similar accuracy faster.

As always, remember that what works best may be problem specific.

