"Deciding what to try next" features


In the image, how does experimenting with a smaller or additional set of features affect bias and variance? If I consider various features, such as x1, x2, …, xn, and my model is structured as w1x1 + w2x2 + … + wnxn + b, the resulting plot would remain linear irrespective of the number of features chosen, correct? Does the author imply that altering the feature set refers to the inclusion or exclusion of polynomial features? If that’s the case, how do the second and third techniques differ from the fourth as depicted in the image?
@rmwkwok

Hello @abhilash341,

Thanks for sharing your thought process!

It appears to me that, to you, increasing “non-linearity” increases variance and reduce bias, and decreasing “non-linearity” does the opposite.

This is not wrong, but this is not the full story.

I would say, generally, adding more meaningful tunable parameters increases variance. You go from a linear regression model of features “Age”, “Weight” to a linear regression model of features “Age”, “Weight” and “Height”, how would you expect about the change in variance and bias? If you further add “Heart rate”, what then? If you further add “Age^2”, what then?

What do you think?

Raymond

Thanks for the answer, for your question

If you further add “Heart rate”, what then? If you further add “Age^2”, what then?

if you further add “Heart rate” it is still a linear regression, it should not cause high variance and if you further add “Age^2” then it should introduce some variance.

Hello @abhilash341,

From this moment on, we are not comparing “linear vs. non-linear”.

We are only comparing “linear vs. linear”. Consider two models both predicting for the same label but using different number of features.

  • Model A: 3 features. The model therefore has 3 weights and 1 bias.

  • Model B: 300 features. The model there has 300 weights and 1 bias.

Now they are both linear models. Both making prediction to the same label. The only difference is that Model A uses only 3 of the features available to Model B. Note that none of the 300 features is a polynomial feature.

What do you think which model is more likely to have a higher variance?

Remember that one take-away for high variance is that it has higher cost value for CV set than cost value for training set, and what machine learning does is to fit the model in the best way possible to the training set.

Cheers,
Raymond

Thanks for the response Raymond, I think the Model B with 300 features should capture the noise and should have high variance compared with model A with 3 features.

1 Like

Yes, @abhilash341, you have made a very good point, and I would like to congratulate you on figuring this out! This comes entirely from you :wink: :wink: :wink:

So, no matter the features are polynomial or not, with more features, there is a higher chance that our model ends up over-fitting to the training data. It is equvialent to say that, with more features, there are more trainable weights in the model, and there is a higher chance that we can tune those weights in a way that over-fits to the training data.

Being a polynomial feature makes it easy to see how it may overfit, but that is not the most decisive factor, because just like what you said, even not a polynomial feature can cause problem.

Cheers!
Raymond

1 Like