Why does RMSProp look at a moving average containing past gradients instead of just the current gradient?

Why does RMSProp look at a moving average containing past gradients instead of just the current gradient? If I’m understanding correctly, the idea of RMSProp is to prevent abnormally large gradients from sending gradient descent off course. In other words, to normalize the gradients.

If this is the case, then why do we need to look at the moving average? I understand how a moving average of the gradients is useful for momentum: it prevents drastic changes in the gradient descent trajectory. However, I don’t see how this makes sense for the normalization performed by RMSProp.

Does anyone have any insight into what RMSProp is actually doing with its moving average?

Ooh, and additionally, why is the Beta 2 hyperparameter used with RMSProp by default so close to 1? (it’s presented as being 0.999). Wouldn’t it basically do nothing with such a large hyperparameter?

Sorry for the slow response here, but I think the best idea would be for you to go back and watch the lectures again. You might want to start with at least the last one about understanding exponentially weighted averages. Then step through Momentum and RMSProp which are two similar ways to use EWAs to make Gradient Descent converge more quickly. Prof Ng answers all the questions you posed in the lectures. RMSProp is not about “normalization” in the standard sense of that term in ML or about dealing with vanishing or exploding gradients: it’s about reducing the stochastic noise in the gradients by using EWA. The whole point is to smooth the behavior by averaging over the relatively recent gradient behavior, which is exactly what EWA provides (in a tunable way). Please watch the lectures again with that idea in mind and I think it will make more sense the second time through. Prof Ng does a much better job explaining this at the white board than I can hope to do by trying to type what he said.

And for the question about \beta_2, take another look at the formulas. Are you talking about Adam or RMSprop? In the Adam case, note that it is using a factor of \displaystyle \frac {1}{(1 - \beta_2^t)} Try computing 0.999^{10} and 0.999^{100} and 0.999^{1000} see what you get. Note that convergence often takes 10s of thousands of iterations.

Thanks for your response and for explaining that RMSProp is about reducing stochastic noise in the gradients. I’ve closely analyzed RMSProp and rewatched the videos, and I now understand what RMSProp is doing.

However, I don’t understand why it is doing it.

In this slide Andrew is explaining how we want to minimize movement in the vertical direction, but make it larger in the horizontal direction in this case. I agree that that is what RMSProp will do in this case. But how do we know that the direction in which we actually want to step has a smaller root mean squared? It actually seems more intuitive to me that over time the horizontal direction would have a larger average gradient, since it is the direction in which we are generally going. In this case RMSProp would actually slow down how fast the model progresses.

Why would that specific parameter of the model even oscillate so vigorously and predictably?

What makes more sense to me is that RMSProp is basically a normalization of the gradients, which brings all gradients down to a more similar scale. That is what this person says in their article:
Day 69: rmsprop. What about some machine learning… | by Tomáš Bouda | 100 days of algorithms | Medium.

Is this wrong?

If you are not satisfied by what Prof Ng says about RMSprop, here’s a lecture by Prof Geoff Hinton that covers similar material. I have not heard of Tomáš Bouda before, but Geoff Hinton is a world recognized expert in all things ML/AI and Bouda also refers to Hinton’s lecture in the article.

The other general comment to make is that the diagram Prof Ng shows to motivate this discussion is highly simplified in that it only shows two dimensions and the same is true for the ones in Hinton’s lecture. It’s hard to visualize what is really happening in cases like this because we are operating in parameter space and that typically has at least hundreds of dimensions and frequently quite a few orders of magnitude more than that. The human brain is only evolved to visualize things in 3 dimensions, of course, so sometimes it’s a bit non-intuitive how the math works out in much higher dimensional spaces.