Hello @ajallooe,
I think developing an understanding of gradient descent is crucial, and also takes time, so this discussion is certainly not the end of it. Because of that, here are things I believe to be relevant to your journey of GD:
To begin with, if you search (Ctrl + F) the Transcript of the video for the keyword “oscillation”, it appears 7 times and they all quite point to message like this one:
So that’s RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent
I need your attention there because, even though you have mentioned that RMSProp squared things to dampen oscillation, it seemed to me that you thought RMSProp was less capable and considered it a down side. I agreed with your description about the sign things, the squaring things, and it cares magnitude instead of direction, but the denominator term in RMSProp did help!
Momentum, however, had something that the denominator (deviation) term couldn’t do, which is to shoot over those bumpy saddle points. That’s the thing about keeping at one direction. Now, divison of labour, I think we can see that each of them has a role to play here. Now, did you consider things at different cases to see how they might act to help:
- gradient term is larger than the deviation term
- deviation term is larger than the gradient term
- both terms have a similar scale
For whether the denominator term can slow things down, I am not surprised it might slow things down at a reasonable rate at the beginning stage of training, BUT, I could be more impressed by the speedy convergence that the term brings. For this, I think we need experiments.
Is it a wrong thing, as your title said, to drag a little bit behind at first but being able to improve the overall training?
Lastly, I have written this that looked at Adam from the angle of Signal-to-Noise ratio which is not my original idea but its paper had described it that way. I hope that will be an useful angle for you.
Cheers,
Raymond