Adam algorithm explanation

Good morning, I’m writing in reference to the Adam algorithm explanation. According to the week2, Adam is a combination of Gradient Descent with momentum (also called momentum) and RMSprop algorithm. However, in the original paper conference the following information has been written: “Our method is designed to combine the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp”. There is no information about Gradient Descent with Momentum which has been mentioned in the course. Is there any resources which explain my sticker? Best regards, Przemyslaw.

Hi, @przemyslaw.

Mathematically momentum is there, of course, but the paper does explicitly relate Adam to RMSProp with momentum in page 5.

My guess is that AdaGrad was not mentioned because it is not covered in the course, and the fact that "AdaGrad corresponds to a version of Adam with \beta_1 = 0, infinitesimal (1 − \beta_2) and a replacement of \alpha by an annealed version \alpha_t = \alpha \cdot t^{-\frac{1}{2}}" is not very intuitive, whereas relating \beta_1 to Gradient Descent with momentum and \beta_2 to RMSProp makes it easy to understand :slight_smile:

1 Like