Why are we calling it entropy of p_{1}, Entropy of p_{i} or p^{node} or P might be better. Because the notation used implies that the entropy function output is related to p_{1} which is not the case. H(p_1), H(p_2),H(p_3)… etc will all get the same entropy value for a given node and hence the notation is confusing, especially when trying to generalize for multi-class decision tree.
If nodes have notations like [C_1,C_2,C_3,......,C_J] and k classes in the dataset then Entropy for given node C_j should be
But I understand this notation is to simplify the calculations like in the below image. My notations might conflict with some other notations but this was just an idea and I still think there needs to be some changes to the notations in videos or an additional slide that generalizes the entropy formula better.
I have searched for similar topics and there seem to be few unanswered posts, which suggests that this topic explanation is not perfectly intuitive unless we understand it fully.
Is my generalization wrong? because for the below example I created I am getting entropy more than 1, which is wrong according to the Entropy graph slide in the above post. Or is the graph strictly for binary classification?
For Example lets consider dataset [cat1, cat2, cat3, dog1, dog2, dog3, dog4, mouse1, mouse2]
Entropy for C_{parent} node is
H(C_{parent}) = -p_{cat}*log_2(p_{cat}) -p_{dog}*log_2(p_{dog}) -p_{mouse}*log_2(p_{mouse})
here p_{cat} = 3/9, p_{dog} = 4/9, and p_{mouse} = 2/9
H(C_{parent}) = (-3/9 * log2(3/9)) - (4/9 * log2(4/9)) - (2/9 * log2(2/9))
Which results in entropy of 1.53 according to WoflramAlpha
Edit: I think I figured it out, the Log base should be same as number of classes and this case it is 3 H(C_{parent}) = (-3/9 * log3(3/9)) - (4/9 * log3(4/9)) - (2/9 * log3(2/9)) = 0.966, And this will be 1 if all of the classes have an equal number of elements, i.e, 3 cats, 3 dogs, 3 mice
That is a very interesting idea about generalizing the notation, and thank you for your analysis and examples!
Generally the entropy in that node should be a function of all probability values for all classes, however, the fact that it is a binary problem introduces the relation p_1 = 1 - p_0 simplifies it to a one variable function and thus H(p_1).
If it were a multiclass problem of N classes, then it would not be simplified to a one variable function because there was only 1 constraint \sum p_i =1 while there were N probabilities. Instead, it would be a N-1 variable function H(p_0, p_1, ..., p_{N-2}) ( note that p_{N-1} is dropped using the constraint ).
I believe this beginner-level course did not mean to introduce decision tree in a general way, as usually a specific way is easier to understand. I also think that spending one week for an in-depth discussion of the binary case is a good introduction to decision tree. Ofcourse, anyone who is considering to generalize it is going to find this discussion fill some gaps and I believe it is a good thing - the scope of the syllabus is clear and defined, and we push the boundaries.
Although a different base ends up a different range, it does not affect any choice of base to work as a relative measurement of information gain, because converting from one base to another only differs by a constant.
Hmm, since binary problem is very common, having simplified one variable function make sense but a single slide with brief mention of more generalized formula would have been nice.
I dont understand the dropping p_{N-1} part. Is p_{N-1} substituted by 1 - \sum p_{N-2} ?. But I just noticed that my generalization is wrong, it should be H(C_j) = -p_0log_k(p_0) -p_1log_k(p_1)+.......-p_{k-1}log_k(p_{k-1})
True, the algorithm will work, the entropy and information gain might not be <=1 but relatively they will still tell the same story.