Confusion about the concepts of entropy and information gain in Decision Tree!

Course-2/Week-4/Decision Tree

Unable to understand why -plogp-(1-p)log(1-p) is being taken as entropy. What really is entropy btw?

Thanks in advance!!

Entropy, in the context of machine learning and decision trees, is a measure of uncertainty or disorder in a dataset. The concept comes from information theory, where entropy is used to quantify the amount of unpredictability or randomness in a system. The formula H(p) = -p \text{log}(p) - (1- p) \text{log}(1- p) is the entropy for a binary classification problem, where p is the probability of one class (say, the positive class) and 1-p is the probability of the other class (the negative class). Log is defined here in base 2 (binary logarithm), which is Shannon’s and conventional.

There is a high entropy (close to 1) when the classes are evenly distributed (e.g., p = 0.5); the entropy is high because there is maximum uncertainty about which class a randomly chosen sample will belong to. Note that the H reaches its higher value when p = 0.5. This means that the probability of the event is 0.5. There is a low entropy (close to 0) when one class dominates (e.g., p = 0.99 or p = 0.001), and its minimum value is reached at p = 0 and p = 1, i.e., the probability of the event is completely predictable. The entropy is low because there’s little uncertainty; most of the data belongs to one class, so there’s less “disorder”.

When building a decision tree, we want to split the data in a way that maximizes information gain, which is the reduction in entropy after a split. Information gain helps the decision tree determine which features best separate the classes, reducing uncertainty and improving classification accuracy. Entropy measures the amount of uncertainty in your data using the formula that captures that uncertainty for a binary classification problem. Information gain is then used to find splits in the decision tree that reduce that uncertainty as much as possible.

1 Like