Hi @Ibrahim_Mustafa,
in addition to @Elemento‘s great reply, check out this exemplary viz:
RMSprop applies an adaption based on squared gradients, see also the Keras Doc:
- Maintain a moving (discounted) average of the square of gradients
- Divide the gradient by the root of this average
Also momentum can by implemented using a moving average, but over the past gradients directly, see also this source.
Momentum can be imagined like „memorising“ the inertia to not get stuck in local minima but make it hopefully to the global optimum in theory. (In practice often you do not need to get to the global optimum as long as the model performance is robust and sufficient…)
If you want to see how RMSprop and momentum can be combined, check out this thread: Adam Optimization Question - #2 by Christian_Simonis
In summary also a bit with respect to the usual goals:
- momentum accelerates your search in the direction of the global minimum by „using the inertia“ to make it over local minima
- RMSProp is sort of preventing to search in the direction of oscillations since it punishes outliers stronger due to squaring gradients
- ADAM combines the heuristics of both Momentum and RMSProp as pointed out in this nice article:
Source: Intro to optimization in deep learning: Momentum, RMSProp and Adam, see also this thread: Why not always use Adam optimizer - #2 by Christian_Simonis
Hope that helps!
Best regards
Christian