Within the Gradient Descent with Momentum lecture, Andrew uses a drastically simplified 2D ring of contours to show how including momentum can have help smooth out the zig-zags of the gradient descent.
The conclusion was made that this would allow you to speed up the Gradient decent process.
There were two possible ways I thought that this could help:
-
Allow you to crank up the learning rate.
This seemed like what held the highest potential to me for improvements with this process. By smoothing out the extreme back and forths across any direction, the result is that you are generally pointed in a more advantageous direction. This would allow you to increase the learning rate to take larger steps without worrying about overshooting and diverging. -
Slightly larger step sizes due to momentum
A much smaller possible benefit I thought of was that it seems like as you run gradient descent the steps sizes start to get smaller. My having a momentum term that incorporates the history, it seems like you would have slightly larger step sizes then you would normally. However, the advantage here in speeding up learning seems like it would be minimal on its own.
To summarize it seems like until you take advantage of āsmoothingā out the zig/zags by increasing the learning rate, it doesnāt seem like you would actually speed up gradient decent. Otherwise it seems like on āavgā you would be moving about the same amount, just with a smoother path (minus the slight speed up ref point 2).
Does this make sense? Or am I missing other benefits on how this would speed up gradient descent.