C2W2: RMSprop has the epsilon term within the square root, while Adam optimization has it outside, why this difference?

From notes:


Is this a minor implementation detail, and can be ignored, or does it reflect something significant?

I saw similar questions, but didn’t find any satisfactory answer.

question on the equation for Adam

Differential addition of epsilon in Batch Norm and RMSProp


Hi @Sahil_Singh1

The reason for epsilon is to improve numerical stability, e.g. when it comes to dividing by very small numbers close to zero. If we would not have epsilon here, we face the risk of gradient float overflow and numerical issues during our optimization.

Regarding your question:

I did not notice before you pointed that out, frankly speaking. But it is interesting. Thanks for highlighting the difference, @Sahil_Singh1!

In the Tensorflow docs we can find that epsilon matches with your definition of ADAM screenshot (where epsilon is not under the square root) for both RMSProp and ADAM and both refer to the same paper and the formula, see also:

Side note: My understanding is that only adding epsilon in the denumerator, (similar to the Adam term) is not sufficient (see also formula 23 in
https://arxiv.org/pdf/1911.05920.pdf) where based on the derivative can be shown that the risk of gradient float overflow still persists.

Anyhow: regarding your question, I would say it’s rather an implementation detail however it is important to be aware of it and make sure to be consistent.

Best regards

1 Like