RMSprop can go wrong?

Hi again!

I would appreciate if somebody can help me verify my understanding around RMSprop:

If we actually have a problem with good gradient that moves majorly in the correct direction (you don’t have much of the unwanted fluctuations), then RMSprop will actually slow down the convergence, not speed it up. Is that correct? If not, can you tell me where am I getting this wrong? If yes, then why this works? Perhaps because most practical problems are of the problematic kind? Maybe high-dimensional spaces cause these bad problems?

Thank you!
Mohammad

Hi @ajallooe,

Here is the lecture video for RMSProp.

Here is the comparison between the (1) vanilla gradient descent formula, and (2) our RMSProp.

  1. w := w - \alpha g_t

  2. w := w - \alpha g_t \times \frac{1}{\sqrt{v_t}+\epsilon}

where

g_t = \left({\frac{\partial{J}}{\partial{w}}}\right)_t is the gradient at time step t, and
v_t is the squared deviation of the gradient at time step t.

Based on the lecture material, can you explain how you come up with the conclusion that RMSProp is about direction?
Is there any other message you get from the lecture video that you are ready to use to explain to us what RMSProp is about?

  1. please take your time thinking and experimenting. if you think you are not sure about this topic, don’t rush but take this as a second chance of trying to approach it;

  2. please feel free to change your mind, but no matter you want to change or stay, elaborate how you come to your latest conclusion based on the lecture materials.

Raymond

1 Like

Hi Raymond,

Thank you for your reply. This is the diagram Andrew uses to explain RMSprop:


Notice how the direction of unwanted oscillation has a greater magnitude than that of “forward” to destination point. The way Andrew explains it (see video from 2:40 to 4:40) RMSprop works because it dampens large movements. Unlike with GD with momentum, RMSprop does not care if directions of movement are opposing each other, since with GD with momentum, positive and negative movements cancel each other, resulting in less buildup of ‘momentum’. In contrast, with RMSprop, since things are squared, we only care about magnitude, not signs. And it is a dampening scheme as things are divided by \sqrt{S_{d \mathbf{\Theta}}}, so gradient directions with bigger magnitudes get dampened. This above is an example of the bad case where RMSprop, in the way Andrew uses to explain the concept, makes sense.

Now, I am assuming you can also have gradient directions like this:


…where bigger movements happen in the correct directions. In such a case, RMSprop slows down convergence by dampening big movements.

Mohammad

I figured it out. What I was overlooking is that the second case is wrong. It is always the case that Andrew drew that diagram in example of since the gradients are always tangential to those contour plots by definitions of gradients and contour plots.

Anyway, thank you Raymond.

Hello @ajallooe,

I think developing an understanding of gradient descent is crucial, and also takes time, so this discussion is certainly not the end of it. Because of that, here are things I believe to be relevant to your journey of GD:

To begin with, if you search (Ctrl + F) the Transcript of the video for the keyword “oscillation”, it appears 7 times and they all quite point to message like this one:

So that’s RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent

I need your attention there because, even though you have mentioned that RMSProp squared things to dampen oscillation, it seemed to me that you thought RMSProp was less capable and considered it a down side. I agreed with your description about the sign things, the squaring things, and it cares magnitude instead of direction, but the denominator term in RMSProp did help!

Momentum, however, had something that the denominator (deviation) term couldn’t do, which is to shoot over those bumpy saddle points. That’s the thing about keeping at one direction. Now, divison of labour, I think we can see that each of them has a role to play here. Now, did you consider things at different cases to see how they might act to help:

  1. gradient term is larger than the deviation term
  2. deviation term is larger than the gradient term
  3. both terms have a similar scale

For whether the denominator term can slow things down, I am not surprised it might slow things down at a reasonable rate at the beginning stage of training, BUT, I could be more impressed by the speedy convergence that the term brings. For this, I think we need experiments.

Is it a wrong thing, as your title said, to drag a little bit behind at first but being able to improve the overall training?

Lastly, I have written this that looked at Adam from the angle of Signal-to-Noise ratio which is not my original idea but its paper had described it that way. I hope that will be an useful angle for you.

Cheers,
Raymond