C2W2: RMSprop has the epsilon term within the square root, while Adam optimization has it outside, why this difference?

Christian_Simonis · April 23, 2023, 8:39am

The reason for epsilon is to improve numerical stability, e.g. when it comes to dividing by very small numbers close to zero. If we would not have epsilon here, we face the risk of gradient float overflow and numerical issues during our optimization.

Regarding your question:

I did not notice before you pointed that out, frankly speaking. But it is interesting. Thanks for highlighting the difference, @Sahil_Singh1!

In the Tensorflow docs we can find that epsilon matches with your definition of ADAM screenshot (where epsilon is not under the square root) for both RMSProp and ADAM and both refer to the same paper and the formula, see also:

tf.keras.optimizers.Adam | TensorFlow v2.14.0
tf.keras.optimizers.experimental.RMSprop | TensorFlow v2.14.0
[1412.6980] Adam: A Method for Stochastic Optimization (in the formula just before Section 2.1)

Side note: My understanding is that only adding epsilon in the denumerator, (similar to the Adam term) is not sufficient (see also formula 23 in
https://arxiv.org/pdf/1911.05920.pdf) where based on the derivative can be shown that the risk of gradient float overflow still persists.

Anyhow: regarding your question, I would say it’s rather an implementation detail however it is important to be aware of it and make sure to be consistent.

Best regards
Christian

Topic		Replies	Views
Why do you add epsilon in the denominator of the final updating step of Adam optimization? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	587	April 23, 2023
Course 2 week 2 question on the equation for Adam Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	647	September 7, 2021
C2W3 Differential addition of epsilon in Batch Norm and RMSProp Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	534	August 25, 2021
C2 W2: Improving Deep Neural Networks Week 2 Programming Assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	11	498	April 6, 2024
Programming assignment (Week 2) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	508	February 19, 2023

C2W2: RMSprop has the epsilon term within the square root, while Adam optimization has it outside, why this difference?

Related topics