In the week 4 lab, we loop over the list of features and pick the one with the highest information gain. But we never remove a feature from the consideration set after its chosen. This did not end up being a problem with the lab, but couldn’t build_tree_recursive()
choose the same feature for multiple layers of depth? For example, suppose there are 3 features where 1 is spot-on and the other two are completely worthless. We force build_tree_recursive()
to build a decision tree with max_depth=2; we don’t give it the option to short-circuit. Therefore, isn’t it possible some data set could result in a tree where the root node chooses the same feature as the two leaf nodes? I’m probably not thinking this through very carefully but it seems suspicious that build_tree_recursive()
never removes any features from the consideration set. How does it avoid picking the same feature every time? Is it just because that’s the way the math works out? Thanks in advance for helping me understand better.
Hi Micheal, thanks for the question!
Because we are letting the get_best_split
to decide for us what feature to use, and yes, it’s possible that all splitting decisions favour one and only one feature.
In the first place, build_tree_recursive
called get_best_split
that judges on which feature brings us the maximum information gain - in this sense, yes again, we can say that’s the way the math works out, because it’s all about information gain.
However, there is also countering force to avoid some features for being used repeatedly along a branch of a tree. Consider a binary feature (=0 or 1), once we split on it at one node, its two child nodes will either carry all 0 or all 1, and in this case, splitting on the same feature further down the node will not have any information gain (you may examine compute_information_gain
to see the logic behind). If you agree on this, I guess you would also agree the same logic to apply on other categorical features that have more than two distinct values. Of course, such countering effect is more visible to categorical features that have less distinct values, or when the tree depth is already quite large.
In above cases, the same feature is avoided to be reused by no information gain or by not being the best in information gain. In more sophisticated trees implementation like xgboost, there are other parameters to add control to the split decision. On the other hand, there are also parameters that require the splitting algorithm to consider a different set of features at each split.
So, in summary, yes, if the maths (information gain) decides a certain feature to be always the best, then only that feature will be used; however, there is also countering force by the maths; on top of that, we can also artificially avoid this by setting parameters in more sophisticated trees packages. Having said that, I don’t mean it’s always bad to have one feature dominating the splitting if indeed that feature is just so important
Cheers!
Raymond