Can we use gradient descent to find the value of the split threshold that gives the highest information gain?

Hello,

In the practice quiz, Question 3 asks:

For a continuous valued feature (such as weight of the animal), there are 10 animals in the dataset. According to the lecture, what is the recommended way to find the best split for that feature?

The following option is considered as a wrong answer:

Use gradient descent to find the value of the split threshold that gives the highest information gain.

I’m wondering why this is wrong. I feel that the entropy after splitting a continuous variable can be thought of as a cost function as a function of the split threshold. Can’t we use gradient descent to find the split threshold that minimizes the entropy?

It may be possible. But the quiz says to consider what was presented in the lecture.

1 Like