Impact of Feature Scaling on underlying distribution

I’ve been trying to understand when to use min-max/mean normalization vs standardization (z-score normalization).

I understand that standardization is more resilient to outliers as the max and min are not restricted.

I have a couple of questions:

  1. Is it true that neither standardization nor normalization change the underlying shape of the distribution?
  2. Given that I believe that (1) is true, why is it better to use standardization when the underlying data is a gaussian? Given that neither technique will change the data.
1 Like

Both methods do change the data. The range of the features is adjusted, and that makes gradient-based optimizers work better.

It certainly changes the data but I don’t believe it changes the distribution–the curve looks the same. Changing the range simply makes it so that the importance of features are not correlated to their absolute size (i.e. range of 0 to 100 shouldn’t be more important that 0.01 to 0.02).

My question is will the curve look different after normalization or standardization?

1 Like

I think we need to be a little more precise than to say “the curve looks the same”. If you mean (no pun intended) that if you start with a Gaussian Distribution and normalize it using mean normalization, then the resulting distribution is still Gaussian, then I think that is true. So the curve may have a similar overall shape, but a Gaussian with \mu = 0 and \sigma = 1 does not “look the same” as a Gaussian with \mu = 3 and \sigma = 5, right? If you graph them both on the same axes, the curves will be very different. Try it and “see”.

1 Like

What curve are you referring to?

1 Like

Sorry, yes I should be more precise in the wording.
While those distributions (the two gaussean ones that @paulinpaloalto meanted) do not look the same, we have only applied a linear transformation and therefore the underlying relationships within the data remain unchanged. This would be different if we applied a non-linear transformation like log-normalization for example.

I’ve done some research and what I fail to understand is the following statement: “Z-score normalization is useful when the data has a Gaussian distribution”. Why is using z-score normalization “better” than min/max or mean normalization when the data is in a “normal” distribution?

1 Like

Is that something Prof Ng says in the MLS lectures? Or did you find that in some website as you were doing your further research here?

Just as a general matter, my impression is that there isn’t any deep math going on here with getting gradient descent to work efficiently. You just need the values to be in a reasonable range so that the derivatives are well behaved. One simple case is RGB images. You’re starting with unsigned 8 bit values so they range from 0 to 255. You can get pretty crazy gradients with values that large, so just the simplistic approach of dividing the values by 255. gives good behavior.

1 Like

Sounds good, I think maybe I’m thinking a little bit too hard about this for the time being. That quote was from a different website but I’ve found that lots of people say that standardization is good for gaussian curves and wanted to know why.

1 Like