RF Sampling with replacement duplicate rows

gkouro · July 28, 2022, 11:33am

If we sample with replacement one row at a time we allow for duplicate rows to be in the training set of each tree. Doesn’t that bias the results?
I would expect (also based on Random Forest implementation on sklearn) that for each tree, one would get a subset of the whole training set without dupilcate rows.
Anyone care to clarify?

Elemento · July 28, 2022, 12:51pm

Hey @gkouro,
Having duplicate data can certainly bias the results. But there are so many factors working against this that it might not hold much significance.

For starters, as the training dataset keeps on increasing, and the number of samples chosen for training each decision tree keeps on getting decreased (relatively), even with sampling with replacement, choosing the same examples more than once might not be that common.
Similarly, since random forest takes a number of decision trees into account, even if some of the decision trees are slightly biased towards a very small number of examples, the cumulative of the results taken to predict the actual result might tackle that.

However, if you still feel otherwise, implementing random forest won’t be that much difficult, if you use the Decision Tree implementation from scikit-learn. So, try to implement it both ways, and see if you find any differences in the results. Do share your results with the community.

Cheers,
Elemento

rmwkwok · July 29, 2022, 12:20am

Please check this out about the rationale behind sampling with replacement, and also when sampling without replacement is approximately equal to sampling with replacement.

P.S. we usually assume independent and identically distributed data when formulating the optimization algorithm.

Topic		Replies	Views
Randomizing the features choice Advanced Learning Algorithms week-4	3	496	July 17, 2022
Quiz question refers to random forest but answers don't include the main idea Advanced Learning Algorithms week-4	16	894	June 23, 2022
Boosting algorithms Advanced Learning Algorithms week-4	2	390	July 28, 2023
In Random Forest Algorithm "n_estimators" and choosing a subset of "m" training examples for training Advanced Learning Algorithms week-4	5	441	June 8, 2023
Resampling to address dataset imbalance AI for Medical Diagnosis week-1	1	551	May 29, 2022

RF Sampling with replacement duplicate rows

Related topics