Adam algorithm explanation

przemyslaw · June 23, 2021, 5:12pm

Good morning, I’m writing in reference to the Adam algorithm explanation. According to the week2, Adam is a combination of Gradient Descent with momentum (also called momentum) and RMSprop algorithm. However, in the original paper conference the following information has been written: “Our method is designed to combine the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp”. There is no information about Gradient Descent with Momentum which has been mentioned in the course. Is there any resources which explain my sticker? Best regards, Przemyslaw.

nramon · June 24, 2021, 9:49am

Hi, @przemyslaw.

Mathematically momentum is there, of course, but the paper does explicitly relate Adam to RMSProp with momentum in page 5.

My guess is that AdaGrad was not mentioned because it is not covered in the course, and the fact that "AdaGrad corresponds to a version of Adam with \beta_1 = 0, infinitesimal (1 − \beta_2) and a replacement of \alpha by an annealed version \alpha_t = \alpha \cdot t^{-\frac{1}{2}}" is not very intuitive, whereas relating \beta_1 to Gradient Descent with momentum and \beta_2 to RMSProp makes it easy to understand

Topic		Replies	Views
GD with momentum versus ADAM Improving Deep Neural Networks: Hyperparameter tun week-2	3	166	May 8, 2024
Difference between Rmsprop and ADAM Improving Deep Neural Networks: Hyperparameter tun	1	984	April 17, 2023
Optimization algorithms Improving Deep Neural Networks: Hyperparameter tun	2	706	April 8, 2023
Choosing between Momentum, RMSprop and Adam in real life Improving Deep Neural Networks: Hyperparameter tun	4	531	November 3, 2022
Adam Optimization Advanced Learning Algorithms week-2	2	508	August 9, 2022

Adam algorithm explanation

Related topics