C2W2: RMSprop has the epsilon term within the square root, while Adam optimization has it outside, why this difference?

Sahil_Singh1 · April 23, 2023, 6:22am

From notes:
RMSProp:

Adam:

Is this a minor implementation detail, and can be ignored, or does it reflect something significant?

I saw similar questions, but didn’t find any satisfactory answer.

question on the equation for Adam

Differential addition of epsilon in Batch Norm and RMSProp

Christian_Simonis · April 23, 2023, 8:39am

Hi @Sahil_Singh1

The reason for epsilon is to improve numerical stability, e.g. when it comes to dividing by very small numbers close to zero. If we would not have epsilon here, we face the risk of gradient float overflow and numerical issues during our optimization.

Regarding your question:

I did not notice before you pointed that out, frankly speaking. But it is interesting. Thanks for highlighting the difference, @Sahil_Singh1!

In the Tensorflow docs we can find that epsilon matches with your definition of ADAM screenshot (where epsilon is not under the square root) for both RMSProp and ADAM and both refer to the same paper and the formula, see also:

tf.keras.optimizers.Adam | TensorFlow v2.14.0
tf.keras.optimizers.experimental.RMSprop | TensorFlow v2.14.0
[1412.6980] Adam: A Method for Stochastic Optimization (in the formula just before Section 2.1)

Side note: My understanding is that only adding epsilon in the denumerator, (similar to the Adam term) is not sufficient (see also formula 23 in
https://arxiv.org/pdf/1911.05920.pdf) where based on the derivative can be shown that the risk of gradient float overflow still persists.

Anyhow: regarding your question, I would say it’s rather an implementation detail however it is important to be aware of it and make sure to be consistent.

Best regards
Christian

Topic		Replies	Views
Why do you add epsilon in the denominator of the final updating step of Adam optimization? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	603	April 23, 2023
C2W3 Differential addition of epsilon in Batch Norm and RMSProp Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	539	August 25, 2021
Course 2 week 2 question on the equation for Adam Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	652	September 7, 2021
C2 W2: Improving Deep Neural Networks Week 2 Programming Assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	11	538	April 6, 2024
RMSProp formula clarification Improving Deep Neural Networks: Hyperparameter tun week-module-2 , coursera-platform	2	18	September 26, 2024

C2W2: RMSprop has the epsilon term within the square root, while Adam optimization has it outside, why this difference?

Related topics