Adaptive Learning Rates

I wonder what are optimizers with adaptive learning rates? Are there any drawbacks for using them that we should be aware of? If there isn’t any, why are we still using old-school optimizers with strict learning rate parameters? When to use an optimizer with adaptive learning rate like Adam or a classic optimizer with pre-defined learning rate like SGD? If optimizers with adaptive learning rate automates the parameter tuning process, is it true to assume that the model does not need a fine tuning for its learning rate hyperparameter, if it is the case why are we still passing a learning rate value to those optimizers, does it make a difference at all?

The Adam optimizer uses a variable learning rate.

The drawback to the traditional fixed-rate gradient descent is that it never really reaches the minimum, because (alpha * gradient) only approaches zero, but never gets there. It’s also very slow and computationally inefficient.

Old school optimizers are simple to teach in introductory ML courses. In practice, ML doesn’t really much get into how optimizers work, because that’s more of a mathematical art than a machine-learning art.

I believe that you still have to feed the optimizer with an initial learning rate. The optimizer takes over from there.

Hi @makelojan,

I think we generally use optimizers with adaptive learning rates, and as you have pointed out, Adam is one such example. We don’t generally use the vanilla optimizer w: = w - \alpha\frac{\partial{J}}{\partial{w}} but only teach it in early classes as introduction. From vanilla, we can go adaptive, or we can go momentum-based, or we can go both, so starting from vanilla is a good thing to see the potential improvement by those two ideas.

There is optimizer that is designed to not need any learning rate, such as AdaDelta, but the choice of optimizer isn’t about whether we do or don’t need a manual learning rate. In fact, in many implementations of AdaDelta, developers still added learning rate.

With the learning rate parameter, we have one more control. With being adaptive, it adjusts automatically.