Can someone help explain mathematically why normalizing inputs could improve convergence speed in gradient descent?

Hi community,

I am new to deep learning and I am confused why normalizing the input features (x1, x2, …xn) into a same standard range and distribution could help the gradient descent process (which is basically calculating dw1, dw2, …, db1, db2, …)? Some mathematical explanations might be very helpful!

Thanks!

Gradient descent is not guaranteed to work. Among the things that can go wrong is that the shape of the solution surface is so steep or irregular that in order to get convergence and avoid oscillation and divergence you have to choose such a small learning rate (the scaling factor that you use to multiply the gradient vectors to control the size of each “step” in the gradient descent process) that the convergence is incredibly slow.

A simple example is when the images are RGB images: the raw pixel values have the range 0 - 255. The slopes are very steep and you have to use a very small learning rate. If you just divide all the pixel values by 255, things work a lot better. Just try this as an experiment with one of the exercises that use images in one of the courses here, e.g. DLS C1 W4 A2. Try getting it to converge with unnormalized and normalized images.

Note that you filed this under DLS Course 2 which actually deals with some more sophisticated versions of this issue, but you say you are new to DeepLearning. When you actually get to DLS C2 (if you aren’t already there), please watch the lectures starting with this one in Week 2.

Another source of information on the math behind finding solutions for neural networks is this paper from Yann LeCun’s group.

3 Likes