Why do we need momentum when data is normalized during preprocessing in ML or DL?

munish259272 · November 25, 2024, 1:59pm

I believe that such contour plots will be created when the data is not normalized but we do normalize it every time in ML or DL (Please correct me if i am wrong). If that is the case, then why would we need “Gradient Descent with Momentum” if such contour plots can be fixed with normalization of the data and we would not need to apply momentum in that case?

Or normalization does not fix everything, and we may not have a perfect curve in higher dimension and there could be steeper/flat regions in different dimensions at different points/saddle points/sections even with normalized data resulting in uneven learning, so we need to apply momentum even on normalized data? Please help me understand.

gent.spah · November 25, 2024, 2:17pm

Yes, even with normalization, if “your leap” is bigger than needed at a certain moment, then you might overshoot. In a high dimensional space, the complexity of the relief is not so simple.

TMosh · November 25, 2024, 4:20pm

Momentum helps optimize the learning rate during training. It works independently from scaling of the features.

SNaveenMathew · November 25, 2024, 9:01pm

There are 2 points to consider:

Speed of convergence
Number of local minima

When you’re using a loss function (say binary cross entropy) and a simple model architecture (say logistic regression) where there’s only one distinct local minima (also the global minima) without a ‘plateau’ the advantages of using momentum are minimal - only #1 is relevant. However, when you use a more complex network structure and a loss function with multiple local minima and/or a ‘plateau’, momentum helps learns faster and may converge to a ‘better’ local optima.

rmwkwok · November 26, 2024, 2:01am

Hello @munish259272, I just wanted to add that, that particular contour is a very special case (e.g.) when we have a linear model. Such contour, as you said, may be reshaped better with normalization. However, when we have non-linear model, which is what momentum deals with, the contour will, as you said again, become very complex.

Normalization does not “fix” the complexity brought by the model’s non-linearity. The contour on the slide could serve as a simple example for how momentum works, but probably not entirely what momentum can overcome.

Cheers,
Raymond

Topic		Replies	Views
Gradient descent with momentum Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	588	August 15, 2022
Why normalization helps Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	566	July 20, 2023
Checking Intuition: Gradient Descent with Momentum Advantage Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	565	October 5, 2022
Interpreting the benefits of feature scaling Supervised ML: Regression and Classification week-module-1	18	643	February 9, 2023
Can someone help explain mathematically why normalizing inputs could improve convergence speed in gradient descent? Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	1	44	January 10, 2025

Why do we need momentum when data is normalized during preprocessing in ML or DL?

Related topics