W2 - use all features or add them one by one?

Regarding the iterative process, how do people usually start training models when there are a lot of features in the dataset? Should I add them gradually one by one to have more control over what works and what does not, or is it better to add all the features at once and then use bias/variance/error as guidance for the next step?

What do you mean in adding the features gradually one by one? Training the model with this set of features and then training the model with another set?

Use all the features right from the start.

No, not exactly. I mean at the beginning to use some subset of available features to train a model. Then, add another set of features to the first set, train the model, and check whether that second subset improved dev/test errors.

Use all the features right from the start.

What if I could get an appropriate accuracy with only limited set of features? Shouldn’t I strive for the minimum number of features to build the model? I guess the less features I have the less burden I’ll have to support ml solution in a long run.

It’s a theoretical question from the person who does not have much of experience in ml. Sorry, if it does not make sense.

1 Like

Hey! Do you know what the feature actually is? Our output depends on the number of features, so, if we select a few and skip some (probably important features), our model will perform poorly.

For example, let’s say to survive, we need oxygen, water, and food. The rest (shelter, clothes, etc.) are minors. So these three are the features and if we skip any one of them, what will be the output?

Can you correlate?
The feature is on which output depends.

Best,
Saif.

1 Like

No, not usually. If you have the features, you might as well use them.

With modern computers, only if you have a stupendously huge number of features (thousands) would you need to worry about having too many.

Since we’re doing “machine learning”, you usually should not rely on making your own judgement calls about which features are useful.

There are some methods where using fewer features is an advantage. Typically this will be when you have a huge dataset and you need to retrain the model very often. Then you might use an unsupervised (i.e. statistical) method to remove features that don’t show very much variance. Those features probably aren’t doing very much to improve the model.

Exactly. The training based on the cost function (the “machine learning”) will be able to figure out which features matter and which don’t, so more information is better. The only thing it costs you is memory space and cpu cycles during training. But those are cheap compared to the brain power and engineering time it takes to make all the decisions that you are talking about. Also note that training a big model with realistically sized datasets can be time consuming (sometimes days or weeks), so you don’t want to do that multiple times if you can avoid it. There are more sophisticated techniques that Tom mentioned that you can use after the fact to go back and figure out which parameters didn’t matter that much. But if you’ve already got a working model at that point, you’d only invest that effort to downsize it if the compute costs of running the model in inference (“prediction” as opposed to training) mode are a problem. That might be true if (for example) the purpose of the model is to be run on a small device like a cell phone.

In addition to @paulinpaloalto‘s great summary:

  • feature importance helps you in assessing and selecting your features, see also this repo:

    In this repo you can also find the unsupervised methods (like: Principle Component Analysis) as mentioned by @TMosh and in addition also with the Partial Least Square transformation a great supervised method to keep a good „data to feature ratio“ using dimensionality reduction, e.g. by getting rid of redundant information in your features. If you are interested in what is meant with this data/feature ratio, feel free to check out this thread: C3_W2 - PCA Question - #3 by Christian_Simonis

  • in practice you often have the trade off between: adding one or two features (with hopefully better model performance) but then also having higher maintenance and monitoring effort to ensure the quality of your ML pipeline to maintain a certain data quality which means usually more effort and higher total costs of your ML system when operating it

  • one additional point: in deep learning of course you do not need to check necessarily on feature importance when it comes to big and highly unstructured data since there DL takes care of feature engineering in its own. But in classic machine learning it is definitely a good idea to check on feature importance after handcrafting your features with domain knowledge which is considered a best practice by many practitioners in the industry, that work usually rather with structured and limited data and leverage classic machine learning models!

Hope that helps!

Best regards
Christian

I believe this is a good point @vm.mishchenko! This point is worth to discuss a bit: When you can solve your business problem with a limited feature set where you are confident about the data availability & quality along your model lifecycle and have a robust model that satisfies your business requirements: go for it. Probably since you have a small feature set it’s also quite efficient in operations.

Now the question is: if you would improve the model performance and would have better results: would your user be willing to pay more for that better solution? If so, it might be worth to go that way and accept the higher maintenance effort and probably higher total costs of ownership in order to have a better service on the market with improved performance. Please note that this is usually the business view from product management. As a developer of course you anyway want to understand what would be the best model and the optimal technical solution!

What I want to make clear is that there is a big difference between training a nice model on a Jupyter notebook offline and deliver a scalable SaaS service, e.g. in an MLOps scheme.

Therefore, it is really important to understand that the decision on the best number of features in an ML model should take into account:

  • technical aspects AND
  • business requirements!

(see also the CRISP-DM model).

Please let me know if this answers your question!

Best regards
Christian

trade off between: adding one or two features but then also having higher maintenance and monitoring effort to ensure the quality of your ML pipeline

That’s exactly what I was trying to understand initially. Thanks all for unpacking and explaining everything in great detail! There are a lot of nuances that I didn’t think about initially.

1 Like