Can someone help explain mathematically why normalizing inputs could improve convergence speed in gradient descent?

changggao · January 10, 2025, 11:00pm

Hi community,

I am new to deep learning and I am confused why normalizing the input features (x1, x2, …xn) into a same standard range and distribution could help the gradient descent process (which is basically calculating dw1, dw2, …, db1, db2, …)? Some mathematical explanations might be very helpful!

Thanks!

paulinpaloalto · January 10, 2025, 11:48pm

Gradient descent is not guaranteed to work. Among the things that can go wrong is that the shape of the solution surface is so steep or irregular that in order to get convergence and avoid oscillation and divergence you have to choose such a small learning rate (the scaling factor that you use to multiply the gradient vectors to control the size of each “step” in the gradient descent process) that the convergence is incredibly slow.

A simple example is when the images are RGB images: the raw pixel values have the range 0 - 255. The slopes are very steep and you have to use a very small learning rate. If you just divide all the pixel values by 255, things work a lot better. Just try this as an experiment with one of the exercises that use images in one of the courses here, e.g. DLS C1 W4 A2. Try getting it to converge with unnormalized and normalized images.

Note that you filed this under DLS Course 2 which actually deals with some more sophisticated versions of this issue, but you say you are new to DeepLearning. When you actually get to DLS C2 (if you aren’t already there), please watch the lectures starting with this one in Week 2.

Another source of information on the math behind finding solutions for neural networks is this paper from Yann LeCun’s group.

Topic		Replies	Views
A doubt on gradient descent Supervised ML: Regression and Classification week-module-3	4	432	June 13, 2023
Why normalization helps Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	559	July 20, 2023
The relation between scaling and learning rate Supervised ML: Regression and Classification week-module-2	3	553	March 27, 2023
About gradient descent and Features scaling Supervised ML: Regression and Classification week-module-2	6	576	August 19, 2022
Interpreting the benefits of feature scaling Supervised ML: Regression and Classification week-module-1	18	626	February 9, 2023

Can someone help explain mathematically why normalizing inputs could improve convergence speed in gradient descent?

Related topics