Why not always use Adam optimizer

Hello World,

Considering Adam uses both Momentum and RMSProp ideas in its implementation, why not always use Adam optimizer? In what scenarios would one use Momentum or RMSProp instead of Adam?

Hi there,

here is my take on this matter:

  • momentum accelerates your search by „using the momentum“ to make it over local minima and do not get stuck here
  • RMSProp is sort is preventing to search in the direction of oscillations.
  • Adam combines the heuristics of both Momentum and RMSProp as pointed out in this nice article:
    Source: Intro to optimization in deep learning: Momentum, RMSProp and Adam

So different „cost spaces“ will have different numeric approaches to find an acceptable solution as fast as possible. I believe it’s fair so say that Adam is good to start with, but based on the performance within your optimization, you need to check if it’s is finally fulfilling your requirements based on your metrics, see also this thread for some discussion on KPIs to track and evaluate: Underfitting and Overfitting - #2 by Christian_Simonis

In general, I personally also had good experience with Adam as it possesses favourable characteristics as mentioned above.

Side Note: often saddle points can represent an issue in high dimensional spaces. If you are more interested, feel free to take a look at this paper from 2014: https://arxiv.org/pdf/1406.2572.pdf

Best regards
Christian

Hi, in addition to @Christian_Simonis comments I would like to add some more about why Adam is not always the best solution.

  1. Adam uses a moving average of the parameters, which means that it can take longer to converge than other optimizers. This may not be a problem for many problems, but for tasks with a large number of parameters or very small data sets, Adam may be too slow.
  2. Adam is sensitive to the scale of the gradients, so it is important to scale your data before training a model with Adam. If the scale of the gradients is not well-tuned, Adam may have trouble converging.
  3. Adam can also be sensitive to the choice of hyperparameters. It is important to tune the learning rate and other hyperparameters carefully to ensure good performance.
    So, there is no one-size-fits-all optimizer that works best for every problem, and Adam is no exception.

In addition to 2)
If Adam does not converge well, AMSGrad might be worth a look, see also:
https://johnchenresearch.github.io/demon/

Here also some other algorithms are explained like QHM (Quasi-Hyperbolic Momentum) which decouples the momentum term from the current gradient when updating the weights which can also be beneficial!

Best regards
Christian