Why not always use Adam optimizer

kronus · December 22, 2022, 1:37pm

Hello World,

Considering Adam uses both Momentum and RMSProp ideas in its implementation, why not always use Adam optimizer? In what scenarios would one use Momentum or RMSProp instead of Adam?

Christian_Simonis · December 22, 2022, 3:17pm

Hi there,

here is my take on this matter:

momentum accelerates your search by „using the momentum“ to make it over local minima and do not get stuck here
RMSProp is sort is preventing to search in the direction of oscillations.
Adam combines the heuristics of both Momentum and RMSProp as pointed out in this nice article:
Source: Intro to optimization in deep learning: Momentum, RMSProp and Adam

So different „cost spaces“ will have different numeric approaches to find an acceptable solution as fast as possible. I believe it’s fair so say that Adam is good to start with, but based on the performance within your optimization, you need to check if it’s is finally fulfilling your requirements based on your metrics, see also this thread for some discussion on KPIs to track and evaluate: Underfitting and Overfitting - #2 by Christian_Simonis

In general, I personally also had good experience with Adam as it possesses favourable characteristics as mentioned above.

Side Note: often saddle points can represent an issue in high dimensional spaces. If you are more interested, feel free to take a look at this paper from 2014: https://arxiv.org/pdf/1406.2572.pdf

Best regards
Christian

carlosrl · December 23, 2022, 8:22pm

Hi, in addition to @Christian_Simonis comments I would like to add some more about why Adam is not always the best solution.

Adam uses a moving average of the parameters, which means that it can take longer to converge than other optimizers. This may not be a problem for many problems, but for tasks with a large number of parameters or very small data sets, Adam may be too slow.
Adam is sensitive to the scale of the gradients, so it is important to scale your data before training a model with Adam. If the scale of the gradients is not well-tuned, Adam may have trouble converging.
Adam can also be sensitive to the choice of hyperparameters. It is important to tune the learning rate and other hyperparameters carefully to ensure good performance.
So, there is no one-size-fits-all optimizer that works best for every problem, and Adam is no exception.

Christian_Simonis · December 23, 2022, 10:04pm

In addition to 2)
If Adam does not converge well, AMSGrad might be worth a look, see also:
https://johnchenresearch.github.io/demon/

Here also some other algorithms are explained like QHM (Quasi-Hyperbolic Momentum) which decouples the momentum term from the current gradient when updating the weights which can also be beneficial!

Best regards
Christian

Topic		Replies	Views
Choosing between Momentum, RMSprop and Adam in real life Improving Deep Neural Networks: Hyperparameter tun	4	531	November 3, 2022
Difference between Rmsprop and ADAM Improving Deep Neural Networks: Hyperparameter tun	1	984	April 17, 2023
Optimization algorithms Improving Deep Neural Networks: Hyperparameter tun	2	706	April 8, 2023
Adam vs RMSPROP, Momentum Improving Deep Neural Networks: Hyperparameter tun	3	558	January 8, 2023
GD with momentum versus ADAM Improving Deep Neural Networks: Hyperparameter tun week-2	3	166	May 8, 2024

Why not always use Adam optimizer

Related topics