RMS prop in a favorable setting

houzefa303 · April 17, 2021, 5:19pm

In the RMs prop explanation video, there was this contour plot where the gradient oscillates a lot and Prof Andrew Ng said that because dB is small and dW is large, we end up correcting the trajectory of the gradient by dividing with the square roots.

But what if the gradient has already a good trajectory (does it make it worse) ?

paulinpaloalto · April 17, 2021, 6:38pm

The operation we are doing here is scaling the two gradients so that they end up at comparable scale. If they start out at comparable scale, then with the additional scaling they should still end up being comparable. So intuitively it shouldn’t hurt other than the fact that you’ve wasted a bit of computation that you didn’t really need to in the “good” case.

houzefa303 · April 18, 2021, 1:13pm

So if I understood well, RMSprop is scaling the components of the gradient so that they end up being in a comparable scale and Momentum is averaging the gradients over many iterations to reduce oscillations, is that right ?

paulinpaloalto · April 18, 2021, 2:31pm

Yes, that sounds right to me.

agparth · April 19, 2021, 11:44am

Hi! Glad that you asked this question and hope you are enoying the course and discourse Remember to add the particular course in which you found this and add it as a category so that its easier for other learners to refer as well

paulinpaloalto · April 19, 2021, 8:00pm

That’s a good point, Parth. I just moved this thread from “Uncategorized” to the “DLS Course 2” subcategory of the Deep Learning Specialization Category. It’s my first experience trying to move a thread and it was easy, but maybe the UI is not obvious at first glance: what you have to do is click the “Edit Pencil” on the title of the post and that gives you the categorization as one of the things you can modify. We’re all still learning how to use Discourse. It seems really good and the more I learn, the better it gets!

nramon · April 20, 2021, 2:11pm

Hi @houzefa303,

There’s a nice simulator here where you can play with different optimization algorithms. You can choose a point where the direction of steepest descent points to the minimum and compare the trajectories of SGD and RMSprop.

Keep in mind that the objective function seems to be a bit distorted because of the aspect ratio and the learning rate for SGD is twice that of RMSprop.

Have fun!

houzefa303 · April 20, 2021, 2:47pm

Oh very nice !
Thanks

Caleb · August 22, 2021, 11:36am

But if it was required to move at a faster rate in a particular direction, wouldn’t this hinder the progress.

Image the scenario as in the image above

Here, moving along vertical axis is required more than that in horizontal axis.
Here too, RMS would reduce movement along vertical axis if my intuition is correct.
This will result in more iterations to get to the minimum right?

Is this the reason, that Adam optimization algorithm was developed?

jm1e16 · September 6, 2021, 11:25am

As others have mentioned, this scaling doesn’t always make sense.
This will depend on the starting point.

Starting from point A as in the lecture notes, the RMSprop does have the desired effect of compressing the b gradient and extending the step into the w direction.

But what about when starting from point B? We do not want to reduce the b direction step and definitely do not want to introduce a step in the w direction! (which is going to happen using the given formulas). The RMSprop actually introduces oscillations in this case.
Please explain.

jonaslalin · September 9, 2021, 1:13pm

The following article on Medium compares different methods and contains visualizations:

TLDR: The pros with RMSprop is greater than the cons and it works better for the majority of scenarios.

jm1e16 · September 11, 2021, 5:23pm

Thank you for the link, the app is indeed very interesting.
After playing with it for a while, I can see that there is much more going on than discussed in the lectures. It’s a shame that a course like this doesn’t go deeper into the problem.

For example, it seems that the major lead the RMSprop is getting has in most cases nothing to do with the rescaling of the proportions of the individual gradient directions (w and b if still referring to the lecture).
Instead, the lead is achieved simply because the whole magnitude of the gradient vector is scaled up. In another words, the whole step taken in each iteration is larger.
A comparison of RMSProp to a vanilla Gradient Descend where each dW and db are scaled by norm([dW,db]) would make for a much more fair assessment in my opinion.

Topic		Replies	Views
RMS Prop vs GD With Momentum Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	556	May 24, 2021
Checking Intuition: RMSprop Normalization vs Speed Improvement (Post: RMSprop lecture) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	682	October 10, 2022
Week 2 RMSprop intuition Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	617	May 11, 2022
Question about RMSprop Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	276	December 17, 2023
RMSprop can go wrong? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	720	April 29, 2023

RMS prop in a favorable setting

Related topics