Why do we need to normalize data in gradient descent algorithm?

Why do we need to normalize data in gradient descent algorithm ?

Considering the use of mean and standard deviation :
why do we use mean instead of median ?

Thank you in advance

We need to normalize data because it reduces oscillations and compute power (range 0 to 1 is better than huge numbers) overall helping and speeding up convergence to a minima.

The gaussian distribution is defined on mean and mean is different than median, using the mean the normalization is more balanced than median.


Also, normalizing the features allows us to use a larger learning rate without risk of the solution diverging due to excessively large gradients for individual features.


In addition to @gent.spah‘s excellent response, you can also check that his statement is true by taking a look at the definition of the Gaussian normal distribution (on which the normalization step relies) with the standard deviation \sigma and the mean \mu:

{isplaystyle {rac {1}{igma {qrt {2i }}}}e^{-{rac {1}{2}}eft({rac {x-u }{igma }}ight)^{2}}}


So the Gaussian probability density function is defined by the mean, not by the median. But in the end this does not matter anyway in a normal distribution since mean and median are identical due to the perfect symmetry. That being said, of course median and mean can be different in a data set where you do not have this perfect symmetry, especially if the data does not follow a symmetric distribution like the Gaussian or student t distribution.

The other part of your question seems to be covered in these threads, too:

Hope that helps!

Best regards

1 Like