Gradient descent with momentum

in the video of gradient descent with momentum, prof Andrew suggested to use this method in order to overcome the zig-zag movement of the steps. but didn’t we solve this problem when we talked about normalization? we said that normalization can center the data and hence the cost function contour plot will look like circular rather than ellipse as a result we avoid the zig-zag movement and the algorithm get faster. what benefit will G.D with momentum will add if i have already centered the data using normalization?

You can still have zig-zag problems even after normalization. The 3D pictures they show of the cost surfaces are pretty unrealistic. We are dealing with very high dimensional spaces here (as many dimensions as there are parameters, right?), so the surfaces can be pretty gnarly even with normalization and are completely impossible to visualize. Our human brains are just not set up to cope with more than 3 dimensions in terms of visualizing things. Here’s a paper from Yann Lecun’s group about visualizing cost surfaces.

Just to elaborate a bit: notice that we are plotting the cost function J which is a function of all the W^{[l]} and b^{[l]} values. J is a scalar output, of course. So when Prof Ng shows a plot in 3D, the z axis there is cost (J) and the input parameters are the x and y axes, which means that picture is showing you how it would work with literally 2 parameters. That’s right: just two. So you’ve got one layer with w and b. So while the pictures may help with the intuition, they are showing a radically simpler case than what we are actually dealing with.


deep learning
thank you sir

Hi, are momentum and its variations ever useful when the full set of examples is used?