Splitting on a Continous Variable for Decision Trees is Inefficient

In the third week Andrew says that the best way to split on a continous variable for decision trees is to choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain. However, since between examples a big possible delta can be found, why not do a regression for finding a better midpoint? The proposed way may maximize gained information on the construction data set but most probably not on the test data set.

Hello @_juanes_rios,

Let’s think and discuss about this. Take a classification tree for example, after applying a logistic regression over one of the feature, we then still need to figure out what threshold to use, don’t we? Would finding the threshold be as difficult as finding a split point like described in your question?

Or are you thinking like training a general regression model at the beginning that will predict the split-point given any size of data in one feature and the corresponding labels? Did you do any experiment to measure the performance of this idea?


Hi, I am referring to when you have a node in which you divide by weight. Then, the division should be in function of the question: is weight below X?

What I am saying is that in the video they suggest to try to find X by iterating over N-1 values and then choose the one that is better suited for the division. My question really is, why isnt it better to find X via a regression model?

I understand your idea. What I was trying to do is to take one step further and start to think about what it would be like to do it via a regression model. Did you think about it? Exactly how we would do it? How should we define the labels for that regression model in your mind? Once we have a definition, how do we get the labels in the first place? Is this model applicable to any splitting - I mean “any” including the cat/not cat example and any other node splitting.

I am saying we discuss it.

What I would do is find the weight ‘w’ by minimizing the squared distances between w and all the weight variables we have. In this way, w would not necessarily be part of a datapoint we already have. I would use this implementation for any continous variable on a descision tree. What do you think?

I am not sure this would address your concern about efficiency of the decision tree method.