Model performance comparison in Feature selection


first of all, thanks a lot for creating the course, it indeed contains a lot of useful information.
I have some questions about the table of model performance comparison, for example in “Embedded Methods” video on 4:13 minute mark:

How the RandomForest model produces better results, when it is exposed to less information assuming that the hyperparameters are the same?

Another question is a bit long: I assume that the scores presented there, have a degree of randomness and they should have a statistical uncertainty associated to them, e.g. if we decrease the dataset size, the values of the scores will change a bit and thir corresponding statistical uncertainties will increase. Is my understanding correct?

If this is true, then I would assume that at least some of the presented improvement comes from a statistical fluctuation.

Hi @Sviat

The first question is not exactly clear (at least for me), if you could add more details.

Regarding the second question, I assume that you’re talking in general. ML gives always results that are probabilistic in nature. Normally, the bigger is the dataset, the better. But it is not always so simple. Progressing through the specialization, it will become clearer.
Let’s me make an example: very often you can achieve better results by increasing the dataset size. It is the core of the idea of a “Da Centric approach”. But to add more data it is possible that you’ll merge to the old dataset new data, labeled by different people, with a different approach. So it could be also possible that, for these reasons, adding data don’t give all the improvement you expect.
Doing well an ML project takes a lot of effort. You need to develop a well-engineered pipeline.
In addition, to estimate the uncertainty of the predictions, the only useful approach I know is to develop different models (for example, training on independent datasets and compute the variance of the performance metrics (for example, accuracy).