Checking Intuition: Gradient Descent with Momentum Advantage

Within the Gradient Descent with Momentum lecture, Andrew uses a drastically simplified 2D ring of contours to show how including momentum can have help smooth out the zig-zags of the gradient descent.

The conclusion was made that this would allow you to speed up the Gradient decent process.

There were two possible ways I thought that this could help:

  1. Allow you to crank up the learning rate.
    This seemed like what held the highest potential to me for improvements with this process. By smoothing out the extreme back and forths across any direction, the result is that you are generally pointed in a more advantageous direction. This would allow you to increase the learning rate to take larger steps without worrying about overshooting and diverging.

  2. Slightly larger step sizes due to momentum
    A much smaller possible benefit I thought of was that it seems like as you run gradient descent the steps sizes start to get smaller. My having a momentum term that incorporates the history, it seems like you would have slightly larger step sizes then you would normally. However, the advantage here in speeding up learning seems like it would be minimal on its own.


To summarize it seems like until you take advantage of ā€˜smoothing’ out the zig/zags by increasing the learning rate, it doesn’t seem like you would actually speed up gradient decent. Otherwise it seems like on ā€˜avg’ you would be moving about the same amount, just with a smoother path (minus the slight speed up ref point 2).

Does this make sense? Or am I missing other benefits on how this would speed up gradient descent.

Your statements sound correct to me. Here are a few additional thoughts on this general subject:

Yes, the pictures are highly simplified almost to the point of absurdity. The problem is that our human brains just are not evolved to deal with visualizing things in more than 3 spatial dimensions. Note that the z axis there is cost, so we have exactly two parameters in this geometric example. It is not unusual for NN models to have millions or even billions of parameters, so the dimensionality of the parameter space is so high that it’s not at all clear that any intuitions we gain from considering 3D plots are that useful. Here’s a paper about visualizing the complexity of solution surfaces. Here’s another paper from Yann LeCun’s group showing that even though the surfaces are incredibly complex and non-convex, the problem of finding local minima is actually not that much of a practical problem.

We soon graduate to using TensorFlow for implementing everything and that means we no longer have to write the low level algorithms like momentum and GD in general. TF provides more sophisticated algorithms that don’t use fixed learning rates, for example. The general statement in all of this is that there is no ā€œsilver bulletā€ solution that works best in all cases. But in a lot of the examples we see throughout the rest of the DLS series, Adam Optimization seems to be one of the more commonly used algorithms. Prof Ng will discuss that later in this same week.

1 Like