RF Sampling with replacement duplicate rows

If we sample with replacement one row at a time we allow for duplicate rows to be in the training set of each tree. Doesn’t that bias the results?
I would expect (also based on Random Forest implementation on sklearn) that for each tree, one would get a subset of the whole training set without dupilcate rows.
Anyone care to clarify?

Hey @gkouro,
Having duplicate data can certainly bias the results. But there are so many factors working against this that it might not hold much significance.

  • For starters, as the training dataset keeps on increasing, and the number of samples chosen for training each decision tree keeps on getting decreased (relatively), even with sampling with replacement, choosing the same examples more than once might not be that common.
  • Similarly, since random forest takes a number of decision trees into account, even if some of the decision trees are slightly biased towards a very small number of examples, the cumulative of the results taken to predict the actual result might tackle that.

However, if you still feel otherwise, implementing random forest won’t be that much difficult, if you use the Decision Tree implementation from scikit-learn. So, try to implement it both ways, and see if you find any differences in the results. Do share your results with the community.


Please check this out about the rationale behind sampling with replacement, and also when sampling without replacement is approximately equal to sampling with replacement.

P.S. we usually assume independent and identically distributed data when formulating the optimization algorithm.