Entropy notation is confusing

tinted · August 27, 2023, 2:57pm

Why are we calling it entropy of p_{1}, Entropy of p_{i} or p^{node} or P might be better. Because the notation used implies that the entropy function output is related to p_{1} which is not the case. H(p_1), H(p_2),H(p_3)… etc will all get the same entropy value for a given node and hence the notation is confusing, especially when trying to generalize for multi-class decision tree.

If nodes have notations like [C_1,C_2,C_3,......,C_J] and k classes in the dataset then Entropy for given node C_j should be

H(C_j) = -p_0log_k(p_0) -p_1log_k(p_1)+.......-p_klog_k(p_k)

The information Gain formula should be

IG = H(C_{parent}) - (w^{left} H(C_{left}) + w^{right} H(C_{right}))

But I understand this notation is to simplify the calculations like in the below image. My notations might conflict with some other notations but this was just an idea and I still think there needs to be some changes to the notations in videos or an additional slide that generalizes the entropy formula better.

I have searched for similar topics and there seem to be few unanswered posts, which suggests that this topic explanation is not perfectly intuitive unless we understand it fully.

tinted · August 27, 2023, 4:00pm

Is my generalization wrong? because for the below example I created I am getting entropy more than 1, which is wrong according to the Entropy graph slide in the above post. Or is the graph strictly for binary classification?

For Example lets consider dataset [cat1, cat2, cat3, dog1, dog2, dog3, dog4, mouse1, mouse2]

Entropy for C_{parent} node is

H(C_{parent}) = -p_{cat}*log_2(p_{cat}) -p_{dog}*log_2(p_{dog}) -p_{mouse}*log_2(p_{mouse})
here p_{cat} = 3/9, p_{dog} = 4/9, and p_{mouse} = 2/9

H(C_{parent}) = (-3/9 * log2(3/9)) - (4/9 * log2(4/9)) - (2/9 * log2(2/9))
Which results in entropy of 1.53 according to WoflramAlpha

Edit: I think I figured it out, the Log base should be same as number of classes and this case it is 3
H(C_{parent}) = (-3/9 * log3(3/9)) - (4/9 * log3(4/9)) - (2/9 * log3(2/9)) = 0.966, And this will be 1 if all of the classes have an equal number of elements, i.e, 3 cats, 3 dogs, 3 mice

TMosh · August 27, 2023, 7:48pm

I’ll look into this further, I’m not certain whether modifying the log base is correct.

rmwkwok · August 27, 2023, 9:56pm

Hello @tinted,

That is a very interesting idea about generalizing the notation, and thank you for your analysis and examples!

Generally the entropy in that node should be a function of all probability values for all classes, however, the fact that it is a binary problem introduces the relation p_1 = 1 - p_0 simplifies it to a one variable function and thus H(p_1).

If it were a multiclass problem of N classes, then it would not be simplified to a one variable function because there was only 1 constraint \sum p_i =1 while there were N probabilities. Instead, it would be a N-1 variable function H(p_0, p_1, ..., p_{N-2}) ( note that p_{N-1} is dropped using the constraint ).

I believe this beginner-level course did not mean to introduce decision tree in a general way, as usually a specific way is easier to understand. I also think that spending one week for an in-depth discussion of the binary case is a good introduction to decision tree. Ofcourse, anyone who is considering to generalize it is going to find this discussion fill some gaps and I believe it is a good thing - the scope of the syllabus is clear and defined, and we push the boundaries.

Cheers,
Raymond

rmwkwok · August 27, 2023, 10:12pm

Although a different base ends up a different range, it does not affect any choice of base to work as a relative measurement of information gain, because converting from one base to another only differs by a constant.

Cheers,
Raymond

tinted · August 27, 2023, 11:09pm

Hmm, since binary problem is very common, having simplified one variable function make sense but a single slide with brief mention of more generalized formula would have been nice.

I dont understand the dropping p_{N-1} part. Is p_{N-1} substituted by 1 - \sum p_{N-2} ?. But I just noticed that my generalization is wrong, it should be
H(C_j) = -p_0log_k(p_0) -p_1log_k(p_1)+.......-p_{k-1}log_k(p_{k-1})

True, the algorithm will work, the entropy and information gain might not be <=1 but relatively they will still tell the same story.

rmwkwok · August 28, 2023, 12:43am

Exactly!!

rmwkwok · August 28, 2023, 9:20am

Hey, @tinted,

It should be “substituted by 1 - \sum_{i=0}^{N-2} p_i”

Topic		Replies	Views
Confusion about the concepts of entropy and information gain in Decision Tree! Advanced Learning Algorithms week-module-4	1	361	September 2, 2024
Compute information gain question Advanced Learning Algorithms week-module-4	2	526	July 18, 2022
Explanation of the formula for Information Gain in the decision nodes Advanced Learning Algorithms week-module-4	1	289	January 10, 2024
Misleading Info on Entropy Formula for Decision Tree Advanced Learning Algorithms week-module-4	1	577	July 16, 2022
Confusion regarding notation of information gain Advanced Learning Algorithms week-module-4	1	406	June 28, 2023

Entropy notation is confusing

Related topics