Adam Optimiztion


Adam optimization if we set beta1 and beta2 to zero then it works out to:

=> W = W - learning rate * dW / sqrt (square(dW))
=> W = W - learning rate

This doesn’t make sense as with for beta set to zero, I thought formula should work out to:

W = W - learning rate * dW

Am I missing something?

1 Like

Hi, @rajsura82.

I think what you’re really left with is W = W - learning_rate x sign(dW), which intuitively makes sense:

  • Adam combines the ideas behind Momentum and RMSprop. You’re removing the effect of Momentum by setting beta1 to 0, so you’re left with the RMSprop part.
  • RMSprop combines the ideas of using only the sign of the gradient and adapting the step size separately for each weight. By setting beta2 to 0 you end up taking steps of size learning_rate in the direction opposite to the gradient (this is not equivalent to vanilla gradient descent).

Very interesting topic, @rajsura82. I hope my intuition is right. It would be great to hear more opinions :slight_smile:


By the way, you do get standard gradient descent if you’re just applying Momentum and you set beta to 0. Maybe that’s why you were expecting to get that formula.

ADAM expands on the idea of adding momentum to the optimization process. Its formulation, however, does not suggest it falls to simple gradient descent with the betas set to 0. As @nramon mentioned, the simple momentum does reduces to simple gradient descent with its beta set to 0

1 Like

Thanks. Your intuition makes sense.

1 Like