The reason for epsilon is to improve numerical stability, e.g. when it comes to dividing by very small numbers close to zero. If we would not have epsilon here, we face the risk of gradient float overflow and numerical issues during our optimization.
Regarding your question:
I did not notice before you pointed that out, frankly speaking. But it is interesting. Thanks for highlighting the difference, @Sahil_Singh1!
In the Tensorflow docs we can find that epsilon matches with your definition of ADAM screenshot (where epsilon is not under the square root) for both RMSProp and ADAM and both refer to the same paper and the formula, see also:
- tf.keras.optimizers.Adam | TensorFlow v2.14.0
- tf.keras.optimizers.experimental.RMSprop | TensorFlow v2.14.0
-
[1412.6980] Adam: A Method for Stochastic Optimization (in the formula just before Section 2.1)
Side note: My understanding is that only adding epsilon in the denumerator, (similar to the Adam term) is not sufficient (see also formula 23 in
https://arxiv.org/pdf/1911.05920.pdf) where based on the derivative can be shown that the risk of gradient float overflow still persists.
Anyhow: regarding your question, I would say it’s rather an implementation detail however it is important to be aware of it and make sure to be consistent.
Best regards
Christian