For the feature engineering and polynomial regression lab, why do larger parameters signify higher importance? Maybe this only applies once we’ve rescaled all the features?

Hello @Alexander_Leon, I think you are talking about the following:

I think the description is not oriented to discuss feature importance.

The example says the label is squared x (because of this code line: `y = x**2`

), and among the provided features (`X = np.c_[x, x**2, x**3] `

), we would have only needed the second one which is also squared x and forgotten about the others.

Even all three were provided, the training algorithm was still able to relatively suppress the `x`

term and the `x**3`

term by making their weights close to zero. However, * it is not good enough because we already knew both should just be 0* - and by scaling the features, the algorithm can push them even closer to zeros which is the improvement.

I think we are not discussing feature importance here, but more on how scaling improves gradient descent result.

However, I do have some personal opinions about measuring feature importance by the size of linear regression parameters/weights.

Firstly, yes, I agree that scaling the features is the * least thing* we should do to make the parameters comparable.

However, the * ideal scenario* is when the features are uncorrelated with each other.

For example if you have three extremely highly correlated features which are all very good predictors for the label, they are competing with each other for larger weight values. Comparing to the case of only using one of the three in our feature set, using all three can result in 1/3 the weight for each of them and consequently possibly even smaller than another feature that is literally less relevant, which is a wrong measurement.

As a result, the parameter size alone is not just about feature importance, but also coupled with the degree of correlations with other features.

The simplest concept is that the magnitude of the weight values tells you how much impact that feature has on the hypothesis.

The larger the magnitude of the weight value (either positive or negative), the more it influences the hypothesis.

A weight value of zero means that feature has no useful impact on the hypothesis, because the weight is multiplied by the feature value.