Andrew: "And the boosting procedure will do this for a total of B times where on each iteration, you look at what the ensemble of trees for trees 1, 2 up through (b- 1), are not yet doing that well on. And when you’re building tree number b, you will then have a higher probability of picking examples that the ensemble of the previously sample trees is still not yet doing well on. "
When we are on the second last iteration of bagged algorithm and we are sampling for (B - 1) time, we will create a tree based on the examples that we get and then go back to our original dataset and use boosting procedure for (b - 1) time.
Keeping that in mind when we are on our last iteration of bagged algorithm, we are sampling for B’th time and this implies that we have created our final tree and more boosting is needed as there is no other tree that will be created.
Based on that shouldn’t Boosting procedure be done a total of B-1 times? Rather than B times as said by Andrew in the course lecture?
Hi @Ammar_Jawed, I don’t quite follow what you have said in below.
Therefore, let me just rephrase it, and I hope you can follow my version:
On building tree b = 2, examine all samples’ predictions by an emsemble of 1 tree, select m samples from all samples but with higher probability of picking those that were predicted poorly, then build the 2th tree with the selected m samples
On building tree b = B-2, examine all samples’ predictions by an emsemble of B-3 trees, select m samples from all samples but with higher probability of picking those that were predicted poorly, then build the (B-2)-th tree with the selected m samples
On building tree b = B-1, examine all samples’ predictions by an emsemble of B-2 trees, select m samples from all samples but with higher probability of picking those that were predicted poorly, then build the (B-1)-th tree with the selected m samples
On building tree b = B, examine all samples’ predictions by an emsemble of B-1 trees, select m samples from all samples but with higher probability of picking those that were predicted poorly, then build the B-th tree with the selected m samples
Therefore, the boosting procedure happens B times for building B trees.
I think there is still confusion with this. Let me try to explain it in short.
My question is that if the boosting procedure starts from b=2 tree where it examines all samples predictions by ensemble of b<2 trees, and ends at b=B tree where it examines all samples predictions by ensemble of b<B trees.
Then according to this total steps taken from b=2 till b=B should be B-1 steps, right? And based on that the boosting procedure should do this for a total of B-1 times, where as in the course Andrew said this: “And the boosting procedure will do this for a total of B times…”
No problem. We have made major progress because now the focus is on the first tree.
I think the building of the first tree is also considered as a boosting round, and with that, we have one round plus the remaining B - 1 rounds which is totalled B.
The difference about the first tree from the rest of the B - 1 trees is that there is no previous tree before the first one is built, so we cannot make a prediction that is based on previous trees, and consequently, the sampling cannot be based on such prediction. This makes the first tree different and our focus.
The handling of the first tree depends on the XGBoost, since I have not looked into that, I can’t be certain. A naive way would be for the sampling to not rely on any prediction and, instead, resort to purely random sampling.
Another point I want to make here is that, the sampling approach is not what defines a tree as a boosting round. Even if all of the trees use a purely random sampling method, or if all of the trees just use all of the samples without any subsampling, all of the trees are still boosting rounds, because, for example, each and every tree learns to minimize its prediction to the residual from previous rounds. It is only that in the first round, the residual of a sample can, for example, just be the sample’s label.
We don’t exclude the first tree when counting the number of boosting round just because it does not have a previous tree. All of the trees, including the first one, are boosting rounds, and I think it is because all of them are constructed under the framework of gradient boosting.