I thought the purpose of scaling/normalizing was to assist in the speed and effectiveness of convergence in optimization (e.g. gradient descent) in a learning algorithm’s learning process…
but the following paragraph in the lab (cell#25) makes me wonder if I’ve missed something else more profound about it’s purpose/effect… but I’m really not sure:
Since you are dealing exclusively with geospatial data you will create some transformations that are aware of this geospatial nature. This help the model make a better representation of the problem at hand.
For instance the model cannot magically understand what a coordinate is supposed to represent and since the data is taken from New York only, the latitude and longitude revolve around ( 37
, 45
) and ( -70
, -78
) respectively, which is arbitrary for the model. A good first step is to scale these values.
So, is there some type of feature representation learning or some type of model interpretability or something else that feature scaling/normalization is performing… ?
Unconfuse me please.
Hi @shahin
really interesting question.
To “unconfuse” I would say that:
- In general scaling the features is to help the convergence of the model. But also to avoid that we have features on a different scale. In some models (thinks linear one) if one feature has a much bigger scale than others it is difficult to see the effect of variation of other features.
- Having said that, we could enter in the field of “feature engineering”: if we can transform the feature, or some features, in such a way that their information content is easily shown probably the model will make better use of it.
But, to be honest, in general, scaling is for point n. 1.
Thanks @luigisaetta ,
I would venture then to say that extracting/deriving more impactful numbers (that are more pertinent to the label), from the raw numbers aka “feature engineering” or more specifically “feature extraction” (i.e. to feed distances in to the learning algorithm rather than individual pairs of coordinates) just gives the learning algorithm a helping hand. If I’m not mistaken, machine learning is limited to learn a function using only multiplication and summation operations, no subtraction (let alone euclidean distance calculations) are included.
Scaling, as you confirmed, is purely an aid to the optimization algorithm. It was discussed in Course 2, but seems to be conflated there and here with “feature engineering”. It’s getting close to a semantic/philosophical point but I think it is potentially confusing. If I may suggest, it might be better to edit that paragraph to end it at “…arbitrary for the model…”, and in a clearly separated section, replace “A good first step is to scale these values” with something like “A powerful and simple preprocessing step that benefits optimization algorithms that are used by certain machine learning algorithms is to scale the numerical features, as described in week 2 of course 2.”