I’m wondering if there are times when scaling the features isn’t appropriate, particularly when we want to use PCA to reduce the number of features. Reason being, if we normalise the scale, then all features now have the same scale, and PCA will have to find a fit against all features.
For example, let’s take the Car length + Wheel size example given in the lectures. In that example, there’s a pretty linear relationship between car length and wheel size. But imagine that both features tend to be independent, yet wheel size varies only on a small scale whereas car length varies a lot. What if we now want to reduce the features to those that best represent the variations of different types of cars?
If we pre-scale our features, then car length will now be in the range -1 to 1, and so will wheel size. In contrast, if we leave normalise the features without scaling, then PCA will identify car length as the 1st principle component, and wheel size as a secondary perpendicular axis.
So, first question, is this reasoning valid? Does this mean that there are times when scaling is counter-productive?
Second question, if so, I’m trying to think what the general rule would be for when we want to scale and when we don’t want to. It seems that scaling would generally be counterproductive if the features are already independent and we’re using PCA to find the “most significant” features. Would that make sense?
Scaling feature ensures that each feature contributes equally. However, if some features vary more than others (like car length versus wheel size), scaling could obscure these inherent variances. In such cases, PCA might not correctly identify the most informative components because it would treat all features as equally varying.
It is good to scale features when the features are measured on different scales, and you want each feature to contribute equally to the PCA. This is common when features represent different units or magnitudes but are considered equally important for the analysis. And, do not scale features when the features’ variances are naturally indicative of their importance, and you want PCA to reflect this. If the features are already independent and represent significant variations without needing to be on the same scale, not scaling allows PCA to prioritize features with larger variances.
You can also check this website that talks about how to handle different feature scales: Click Here
TL;DR; You can use:
Min-max normalization scales values so that they fall between 0 and 1.
Standardization scales all values around a mean of 0 and a standard deviation of 1.
In general, standardization is more common and is generally more effective if your values have a normal distribution (i.e., look like a bell curve). Min-max normalization is more effective when your data are not normally distributed.