RMSprop was described as a way to speed up Gradient Descent.
However, the example describes seems to be a special case.
RMSprop as described seems to have the effect scaling the magnitude of gradient descent steps to be closer to each other in each dimension. More of a normalization approach.
Andrew picked a case where:
Magnitude of B was relatively high and more incorrect
Magnitude of W was relatively low and more correct
The higher magnitude of B becomes relatively smaller (better)
The lower magnitude of W becomes relatively larger (better)
If we then increase the learning rate we are improving performance
It seems like we could easily make a counter case:
In this example:
Magnitude of B was relatively low and more incorrect
Magnitude of W was relatively high and more correct
In this case it seems we would:
The higher magnitude of B becomes relatively larger (worse)
The lower magnitude of W becomes relatively smaller (worse)
Is there any type of reason why one would expect relatively larger magnitudes in relatively incorrect directions as the case Andrew presented to be the norm?
If not, it would seem like RMSprop is good at helping to eliminate extreme incorrect magnitudes to help prevent overshooting via normalization? Wouldn’t this also have the disadvantage of eliminating extreme positive steps?
If that is the case, wouldn’t a momentum approach be a better overall solution. Instead of trying to ‘normalize’ step sizes (both good and bad), it seems like momentum actually tries to eliminate incorrect steps magnitudes and only leave steps in the relatively correct direction?
Once we are pointed in the correct direction it seems like momentum would allow us to crank up the learning rate and have the most advantageous results.