Adaptive Learning Rates

makelojan · October 30, 2023, 6:24pm

I wonder what are optimizers with adaptive learning rates? Are there any drawbacks for using them that we should be aware of? If there isn’t any, why are we still using old-school optimizers with strict learning rate parameters? When to use an optimizer with adaptive learning rate like Adam or a classic optimizer with pre-defined learning rate like SGD? If optimizers with adaptive learning rate automates the parameter tuning process, is it true to assume that the model does not need a fine tuning for its learning rate hyperparameter, if it is the case why are we still passing a learning rate value to those optimizers, does it make a difference at all?

TMosh · October 30, 2023, 10:36pm

The Adam optimizer uses a variable learning rate.

The drawback to the traditional fixed-rate gradient descent is that it never really reaches the minimum, because (alpha * gradient) only approaches zero, but never gets there. It’s also very slow and computationally inefficient.

Old school optimizers are simple to teach in introductory ML courses. In practice, ML doesn’t really much get into how optimizers work, because that’s more of a mathematical art than a machine-learning art.

I believe that you still have to feed the optimizer with an initial learning rate. The optimizer takes over from there.

rmwkwok · October 31, 2023, 1:50am

Hi @makelojan,

I think we generally use optimizers with adaptive learning rates, and as you have pointed out, Adam is one such example. We don’t generally use the vanilla optimizer w: = w - \alpha\frac{\partial{J}}{\partial{w}} but only teach it in early classes as introduction. From vanilla, we can go adaptive, or we can go momentum-based, or we can go both, so starting from vanilla is a good thing to see the potential improvement by those two ideas.

There is optimizer that is designed to not need any learning rate, such as AdaDelta, but the choice of optimizer isn’t about whether we do or don’t need a manual learning rate. In fact, in many implementations of AdaDelta, developers still added learning rate.

With the learning rate parameter, we have one more control. With being adaptive, it adjusts automatically.

Cheers,
Raymond

Topic		Replies	Views
Do we need to use a learning rate scheduler for adaptive optimizers like Adam, AdaGrad? Improving Deep Neural Networks: Hyperparameter tun	1	599	July 26, 2021
Adam Optimization Advanced Learning Algorithms week-2	2	509	August 9, 2022
Adam optimzation Advanced Learning Algorithms week-2	1	221	March 4, 2024
General question about learning rate AI Discussions ai-discussions	2	20	September 25, 2024
Optimising the learning rate alpha Supervised ML: Regression and Classification week-2	5	35	December 4, 2024

Adaptive Learning Rates

Related topics