In the RMs prop explanation video, there was this contour plot where the gradient oscillates a lot and Prof Andrew Ng said that because dB is small and dW is large, we end up correcting the trajectory of the gradient by dividing with the square roots.

But what if the gradient has already a good trajectory (does it make it worse) ?

The operation we are doing here is scaling the two gradients so that they end up at comparable scale. If they start out at comparable scale, then with the additional scaling they should still end up being comparable. So intuitively it shouldnâ€™t hurt other than the fact that youâ€™ve wasted a bit of computation that you didnâ€™t really need to in the â€śgoodâ€ť case.

So if I understood well, RMSprop is scaling the components of the gradient so that they end up being in a comparable scale and Momentum is averaging the gradients over many iterations to reduce oscillations, is that right ?

Hi! Glad that you asked this question and hope you are enoying the course and discourse Remember to add the particular course in which you found this and add it as a category so that its easier for other learners to refer as well

Thatâ€™s a good point, Parth. I just moved this thread from â€śUncategorizedâ€ť to the â€śDLS Course 2â€ť subcategory of the Deep Learning Specialization Category. Itâ€™s my first experience trying to move a thread and it was easy, but maybe the UI is not obvious at first glance: what you have to do is click the â€śEdit Pencilâ€ť on the title of the post and that gives you the categorization as one of the things you can modify. Weâ€™re all still learning how to use Discourse. It seems really good and the more I learn, the better it gets!

Thereâ€™s a nice simulator here where you can play with different optimization algorithms. You can choose a point where the direction of steepest descent points to the minimum and compare the trajectories of SGD and RMSprop.

Keep in mind that the objective function seems to be a bit distorted because of the aspect ratio and the learning rate for SGD is twice that of RMSprop.

As others have mentioned, this scaling doesnâ€™t always make sense.
This will depend on the starting point.

Starting from point A as in the lecture notes, the RMSprop does have the desired effect of compressing the b gradient and extending the step into the w direction.

But what about when starting from point B? We do not want to reduce the b direction step and definitely do not want to introduce a step in the w direction! (which is going to happen using the given formulas). The RMSprop actually introduces oscillations in this case.
Please explain.

Thank you for the link, the app is indeed very interesting.
After playing with it for a while, I can see that there is much more going on than discussed in the lectures. Itâ€™s a shame that a course like this doesnâ€™t go deeper into the problem.

For example, it seems that the major lead the RMSprop is getting has in most cases nothing to do with the rescaling of the proportions of the individual gradient directions (w and b if still referring to the lecture).
Instead, the lead is achieved simply because the whole magnitude of the gradient vector is scaled up. In another words, the whole step taken in each iteration is larger.
A comparison of RMSProp to a vanilla Gradient Descend where each dW and db are scaled by norm([dW,db]) would make for a much more fair assessment in my opinion.