In the RMs prop explanation video, there was this contour plot where the gradient oscillates a lot and Prof Andrew Ng said that because dB is small and dW is large, we end up correcting the trajectory of the gradient by dividing with the square roots.
But what if the gradient has already a good trajectory (does it make it worse) ?
The operation we are doing here is scaling the two gradients so that they end up at comparable scale. If they start out at comparable scale, then with the additional scaling they should still end up being comparable. So intuitively it shouldn’t hurt other than the fact that you’ve wasted a bit of computation that you didn’t really need to in the “good” case.
So if I understood well, RMSprop is scaling the components of the gradient so that they end up being in a comparable scale and Momentum is averaging the gradients over many iterations to reduce oscillations, is that right ?
Yes, that sounds right to me.
Hi! Glad that you asked this question and hope you are enoying the course and discourse Remember to add the particular course in which you found this and add it as a category so that its easier for other learners to refer as well
That’s a good point, Parth. I just moved this thread from “Uncategorized” to the “DLS Course 2” subcategory of the Deep Learning Specialization Category. It’s my first experience trying to move a thread and it was easy, but maybe the UI is not obvious at first glance: what you have to do is click the “Edit Pencil” on the title of the post and that gives you the categorization as one of the things you can modify. We’re all still learning how to use Discourse. It seems really good and the more I learn, the better it gets!
There’s a nice simulator here where you can play with different optimization algorithms. You can choose a point where the direction of steepest descent points to the minimum and compare the trajectories of SGD and RMSprop.
Keep in mind that the objective function seems to be a bit distorted because of the aspect ratio and the learning rate for SGD is twice that of RMSprop.
But if it was required to move at a faster rate in a particular direction, wouldn’t this hinder the progress.
Image the scenario as in the image above
- Here, moving along vertical axis is required more than that in horizontal axis.
- Here too, RMS would reduce movement along vertical axis if my intuition is correct.
- This will result in more iterations to get to the minimum right?
Is this the reason, that Adam optimization algorithm was developed?
As others have mentioned, this scaling doesn’t always make sense.
This will depend on the starting point.
Starting from point A as in the lecture notes, the RMSprop does have the desired effect of compressing the b gradient and extending the step into the w direction.
But what about when starting from point B? We do not want to reduce the b direction step and definitely do not want to introduce a step in the w direction! (which is going to happen using the given formulas). The RMSprop actually introduces oscillations in this case.
The following article on Medium compares different methods and contains visualizations:
TLDR: The pros with RMSprop is greater than the cons and it works better for the majority of scenarios.
Thank you for the link, the app is indeed very interesting.
After playing with it for a while, I can see that there is much more going on than discussed in the lectures. It’s a shame that a course like this doesn’t go deeper into the problem.
For example, it seems that the major lead the RMSprop is getting has in most cases nothing to do with the rescaling of the proportions of the individual gradient directions (w and b if still referring to the lecture).
Instead, the lead is achieved simply because the whole magnitude of the gradient vector is scaled up. In another words, the whole step taken in each iteration is larger.
A comparison of RMSProp to a vanilla Gradient Descend where each dW and db are scaled by norm([dW,db]) would make for a much more fair assessment in my opinion.