Why do we need momentum when data is normalized during preprocessing in ML or DL?

I believe that such contour plots will be created when the data is not normalized but we do normalize it every time in ML or DL (Please correct me if i am wrong). If that is the case, then why would we need “Gradient Descent with Momentum” if such contour plots can be fixed with normalization of the data and we would not need to apply momentum in that case?

Or normalization does not fix everything, and we may not have a perfect curve in higher dimension and there could be steeper/flat regions in different dimensions at different points/saddle points/sections even with normalized data resulting in uneven learning, so we need to apply momentum even on normalized data? Please help me understand.

Yes, even with normalization, if “your leap” is bigger than needed at a certain moment, then you might overshoot. In a high dimensional space, the complexity of the relief is not so simple.

Momentum helps optimize the learning rate during training. It works independently from scaling of the features.

There are 2 points to consider:

  1. Speed of convergence
  2. Number of local minima

When you’re using a loss function (say binary cross entropy) and a simple model architecture (say logistic regression) where there’s only one distinct local minima (also the global minima) without a ‘plateau’ the advantages of using momentum are minimal - only #1 is relevant. However, when you use a more complex network structure and a loss function with multiple local minima and/or a ‘plateau’, momentum helps learns faster and may converge to a ‘better’ local optima.

Hello @munish259272, I just wanted to add that, that particular contour is a very special case (e.g.) when we have a linear model. Such contour, as you said, may be reshaped better with normalization. However, when we have non-linear model, which is what momentum deals with, the contour will, as you said again, become very complex.

Normalization does not “fix” the complexity brought by the model’s non-linearity. The contour on the slide could serve as a simple example for how momentum works, but probably not entirely what momentum can overcome.

Cheers,
Raymond