Bisection search for continuous valued features case?

At 4:09 in the video “Decision tree learning - Continuous valued features”, Prof. Ng suggests that: “In the more general case, we’ll actually try not just three values, but multiple values along the X axis. And one convention would be to sort all of the examples according to the weight or according to the value of this feature and take all the values that are mid points between the sorted list of training examples as the values for consideration for this threshold over here. This way, if you have 10 training examples, you will test nine different possible values for this threshold and then try to pick the one that gives you the highest information gain.

Why do we need to test 9 different possible values for 10 training examples!? What if we have 10 mln. training examples. How many values do we then need to test ?

Why not apply a bisection search here?

Hello @Robert_Ascan,

Because there are 9 ways to cut and we are not supposed to know which is the best cut, so in order to determine the best cut, we need to try them all out. This approach is not scalable to large dataset, so popular decision tree based packages like xgboost uses histogram method to reduce the number of cuts needed.

Raymond

P.S. We can discuss about bisection search if you can show that it does not require 9 tests to find out the best cut, in the case of 10 training examples.