In the course we learn about how RMSprop reduce oscillation by scaling down the steps with large oscillation. This is done through scaling down learning in the direction with large gradients. But what if the large gradients are due to fast increase in the particular dimension, wouldn’t RMSprop hurt learning?
For example, in the graph below, no matter where we start out, the square of the gradients in b dimension is always larger than the that in the w dimension, so learning in the b direction would always get scaled down more.
If we start out at A, RMSprop would reduce the gradients more in dimension b, allowing us to use larger learning rate to increase the learning in dimension w.
If we start out at B, RMSprop would again reduce the gradients more in dimension b. But in this scenario, it seems to slow down learning.