As we know the formula for parameter update in Adam is :

W = W - learning_rate * VdW / √SdW+ε
b = b - learning_rate * Vdb / √Sdb+ε

And, I tried to change SdW and Sdb formulas taking square of velocities instead of taking original gradients and then performing as usual the updates.

Modified Formulas:
SdW = β * SdW + (1-β) * VdW^2
Sdb = β * Sdb + (1-β) * Vdb^2

I thought by computing second moments (RMSProp) on velocities might work a way better as it will take less wiggly directions (first order moments) into accounts rather than calculating on square of original gradients which are having huge oscillations. I noticed that this performs well with low learning rate (e.g 1e-2 or even less) and a bit faster than original Adam, in terms of cost too. (Also tried learning rate decay). But specifying large learning rate leads initially low and then gradually increase in cost (huge wiggly curve).

I want to know if it is over smooth in direction or did i make any mistake while computing the formula’s?

Logically if you take the square of a number and multiply by it, the magnitude changes of dw and db will become higher so more oscillations for higher learning rates, hence more unstable for convergence.

Hi @gent.spah Thanks for the reply.

Would you please show me one example using above equations ?

No I dont have any such.

What happens to the magnitude when you square a number critically depends on whether the absolute value is < 1 or > 1, right? So as long as you keep things small, you’re smoothing out the path, but once you tip over to a number larger than 1, then you are more likely to get divergence as Gent points out. So your squaring strategy may work better in some cases, but not always.

You could graph the results of your experiments.

3 Likes