I have 2 important questions from this lab in module 4 “C2_W4_Lab_02_Tree_Ensemble”:
Selected in red box, we have a parameter n_estimators. From the description it sounds like the value of B from b = 1 to B in the lecture slides. Just confirming if n_estimators is the value that decides how many times the procedure of Sampling with Replacement will be performed?
Selected in the blue box, says that the Random Forest Algorithm chooses a subset of the training examples to train each individual tree. From the description it sounds like a subset of training examples “m” is used to train every individual tree “b”.
Just to confirm that in the lecture videos we learned that in a Random Forest Algorithm a subset of features “n” is used to train each individual tree, however this is something that we didn’t learn in the lecture videos but it’s mentioned in the lab that a Random Forest Algorithm, along with choosing a subset of features “n”, chooses also a subset of training examples “m” to train every individual tree. Am I following correctly?
Yes, as long as sampling of data is enabled, there will be one sampling happened before each tree is built. If B trees are built, the data will be sampled B times.
m is the total number of samples, right? If so, then we need to first ask ourselves how manys we are going to sample in building each tree. If we set it to m/2, then m/2 samples will be drawn from the full dataset and it will be used to train the first tree (b=1). Then another m/2 samples will be drawn to train the second tree (b=2), and so on and so forth.
In building a tree, we can choose to do none, one, or both of the following:
sample a subset of data
sample a subset of features
If we choose to do both, then on building the first tree (b=1), a subset of data will be sampled, and then a subset of features will be sampled. If the full dataset has m samples and n features, then after sampling, the full dataset can just have m/2 samples and n/2 features. The actual number of samples and actual number of features depend on your settings.
Sampleing is redone before building every tree.
There are actually even more ways of sampling for features if you dive into xgboost, and they are not covered by the lecture. Note that the lecture does not teach 100% of xgboost, but only has time to give us some core understanding.
Thanks for explaining that, my first question is answered. I just want to make one clarification. When I say “m”, I mean the total number of training examples.
For instance in this lab,
“m” for training data is 734
“m” for validation data is 184
Just like in Neural Networks, where we used “m” to determine the total number of training examples.
Perfect… It makes sense now. Thanks for answering that in detail.
Just one last thing. The subset of samples taken from “m” is a random subset right, just like random subset of “n” features is taken?