As we know the formula for parameter update in Adam is :
→ W = W - learning_rate * VdW / √SdW+ε
→ b = b - learning_rate * Vdb / √Sdb+ε
And, I tried to change SdW and Sdb formulas taking square of velocities instead of taking original gradients and then performing as usual the updates.
Modified Formulas:
→ SdW = β * SdW + (1-β) * VdW^2
→ Sdb = β * Sdb + (1-β) * Vdb^2
I thought by computing second moments (RMSProp) on velocities might work a way better as it will take less wiggly directions (first order moments) into accounts rather than calculating on square of original gradients which are having huge oscillations. I noticed that this performs well with low learning rate (e.g 1e-2 or even less) and a bit faster than original Adam, in terms of cost too. (Also tried learning rate decay). But specifying large learning rate leads initially low and then gradually increase in cost (huge wiggly curve).
I want to know if it is over smooth in direction or did i make any mistake while computing the formula’s?