Impact of Feature Scaling on underlying distribution

adamfweidman · April 16, 2024, 5:03pm

I’ve been trying to understand when to use min-max/mean normalization vs standardization (z-score normalization).

I understand that standardization is more resilient to outliers as the max and min are not restricted.

I have a couple of questions:

Is it true that neither standardization nor normalization change the underlying shape of the distribution?
Given that I believe that (1) is true, why is it better to use standardization when the underlying data is a gaussian? Given that neither technique will change the data.

TMosh · April 16, 2024, 5:12pm

Both methods do change the data. The range of the features is adjusted, and that makes gradient-based optimizers work better.

adamfweidman · April 16, 2024, 5:25pm

It certainly changes the data but I don’t believe it changes the distribution–the curve looks the same. Changing the range simply makes it so that the importance of features are not correlated to their absolute size (i.e. range of 0 to 100 shouldn’t be more important that 0.01 to 0.02).

My question is will the curve look different after normalization or standardization?

paulinpaloalto · April 16, 2024, 5:37pm

I think we need to be a little more precise than to say “the curve looks the same”. If you mean (no pun intended) that if you start with a Gaussian Distribution and normalize it using mean normalization, then the resulting distribution is still Gaussian, then I think that is true. So the curve may have a similar overall shape, but a Gaussian with \mu = 0 and \sigma = 1 does not “look the same” as a Gaussian with \mu = 3 and \sigma = 5, right? If you graph them both on the same axes, the curves will be very different. Try it and “see”.

TMosh · April 16, 2024, 5:47pm

What curve are you referring to?

adamfweidman · April 16, 2024, 5:50pm

Sorry, yes I should be more precise in the wording.
While those distributions (the two gaussean ones that @paulinpaloalto meanted) do not look the same, we have only applied a linear transformation and therefore the underlying relationships within the data remain unchanged. This would be different if we applied a non-linear transformation like log-normalization for example.

I’ve done some research and what I fail to understand is the following statement: “Z-score normalization is useful when the data has a Gaussian distribution”. Why is using z-score normalization “better” than min/max or mean normalization when the data is in a “normal” distribution?

paulinpaloalto · April 16, 2024, 6:05pm

Is that something Prof Ng says in the MLS lectures? Or did you find that in some website as you were doing your further research here?

Just as a general matter, my impression is that there isn’t any deep math going on here with getting gradient descent to work efficiently. You just need the values to be in a reasonable range so that the derivatives are well behaved. One simple case is RGB images. You’re starting with unsigned 8 bit values so they range from 0 to 255. You can get pretty crazy gradients with values that large, so just the simplistic approach of dividing the values by 255. gives good behavior.

adamfweidman · April 16, 2024, 6:37pm

Sounds good, I think maybe I’m thinking a little bit too hard about this for the time being. That quote was from a different website but I’ve found that lots of people say that standardization is good for gaussian curves and wanted to know why.

Topic		Replies	Views
Mean Normalization VS other forms of Feature Scaling Supervised ML: Regression and Classification week-2	2	530	July 28, 2022
Feature Scaling - What method to choose? Supervised ML: Regression and Classification week-2	3	396	August 18, 2023
Question: Why we don't use standardization in feature scaling? Supervised ML: Regression and Classification week-3	3	490	July 24, 2022
Confusing on normalisation Improving Deep Neural Networks: Hyperparameter tun coursera-platform	9	601	July 23, 2021
Question about rescaling Supervised ML: Regression and Classification week-2	4	504	July 7, 2022

Impact of Feature Scaling on underlying distribution

Related topics