At 4:09 in the video “Decision tree learning - Continuous valued features”, Prof. Ng suggests that: “In the more general case, we’ll actually try not just three values, but multiple values along the X axis. And one convention would be to sort all of the examples according to the weight or according to the value of this feature and take all the values that are mid points between the sorted list of training examples as the values for consideration for this threshold over here. This way, if you have 10 training examples, you will test nine different possible values for this threshold and then try to pick the one that gives you the highest information gain.”
Why do we need to test 9 different possible values for 10 training examples!? What if we have 10 mln. training examples. How many values do we then need to test ?
Why not apply a bisection search here?