I think what you’re really left with is W = W - learning_rate x sign(dW), which intuitively makes sense:

Adam combines the ideas behind Momentum and RMSprop. You’re removing the effect of Momentum by setting beta1 to 0, so you’re left with the RMSprop part.

RMSprop combines the ideas of using only the sign of the gradient and adapting the step size separately for each weight. By setting beta2 to 0 you end up taking steps of size learning_rate in the direction opposite to the gradient (this is not equivalent to vanilla gradient descent).

Very interesting topic, @rajsura82. I hope my intuition is right. It would be great to hear more opinions

By the way, you do get standard gradient descent if you’re just applying Momentum and you set beta to 0. Maybe that’s why you were expecting to get that formula.

ADAM expands on the idea of adding momentum to the optimization process. Its formulation, however, does not suggest it falls to simple gradient descent with the betas set to 0. As @nramon mentioned, the simple momentum does reduces to simple gradient descent with its beta set to 0