Adam vs RMSPROP, Momentum

What am I understanding from the optimization methods?

Momentum \rightarrow uses weighted average techniques to control the steps toward the minimum.

RMSPROP \rightarrow If the error is large, RMSPROP increases the step length to a minimum; otherwise, it decreases.

Adam combined the advantages of both

Is that correct?

Hello @Areeg_Fahad,

Here is my version:

  1. Vanilla gradient descent: the gradient speaks about the direction of step, and speaks about the step size which is proportional to the errors (of predictions)

  2. RMSProp: introducing adaptiveness to our vanilla gradient descent. It adds a denominator to suppress oscillation such that it can converge faster.

  3. Adam: replacing the gradient with the momentum-based version of it in RMSProp, and as you said, it uses (exponentially) weighted average, and that makes it more reluctant to rapid change.

So, I agree with your Momentum, but what’s introduced in RMSProp should be about anti-oscillation, and error (of prediction) is already taken care of by the gradient.

Perhaps your “error” was my “oscillation”, and so I have listed my version at the beginning and clearly differentiate between my “error” and my “oscillation”.


Thank you for clarifying

You are welcome @Areeg_Fahad!