Why do we need to take weightage comparision in entropy function?

What Prof. Ng said is not clear to me

If there’s a node with a lot of examples in it with high entropy that seems worse than if there was a node with just a few examples in it with high entropy. Because entropy, as a measure of impurity, is worse if you have a very large and impure dataset compared to just a few examples

and a branch of the tree that is very impure.

actually I never get weightage comparison as metrics never

Hello @tbhaxor,

This response relies 90% on intuition. It doesn’t fully explain the root cause mathematically, but if you want to go to the maths, “Maximum Likelihood Estimation” (MLE) is the keyword to start from.

Let’s consider this split where 10 out of 15 samples are “T” and the rest “F”.

We know how to calculate the entropy before splitting:

There are 3 points to observe:

  1. I took the 15 outside
  2. It’s already “weightage comparison”, or in my words, “weighted sum of log probability”
  3. It can be read as: 10 samples have 10/15 chance be classifed as “T”, 5 have 5/15 as “F”


Now, after splitting:

  1. I still took the 15 outside. This point doesn’t change.
  2. It’s still “weighted sum of log probability”. This point doesn’t change.
  3. It can be read as: 8 samples have 8/9 chance as “T”, 1 sample 1/9 as “F”, another 2 samples 2/6 as “T”, another 4 samples 4/6 as “F”.

If you make some small changes to the after-splitting formula, you get what we have learnt:

As my point number 2 said, it has always been some weighted sum.

Again, this response is 90% intuition, 10% MLE.


The above reply is for explaining how we calculate the entropy before and after split. “Weighted sum” is a consequence of it. Also, we have always been calculating Weighted sums, and it isn’t like suddenly showing up when we split.

Here why are you taking natural log, not base 2. Also I often get stuck on deciding which base of log to take while calculation. How do you determine this?

What I think is importance to that node should be given more which has a lot of splits because down the tree it will decide more splits on the basis of the features, and the node with less split should be given less importance in learning because it wont further split that much, that the previous node will do.

It’s just my practice to use natural log. Please use base 2 instead to be consistent with the lecture.
The choice is not important, because they all should deliver the same decision tree, because changing from one base to another is only differed by a constant.

What is this explaining for?

Also, I think we can’t determine the importance of a node by the number of splits down it. Imagine I have a node under which there is only one split, but both left and right of the split is 100% pure and they contain 95% of all data. This node with only one split is VERY important. Agree?

1 Like

Ohk tell me this why we need to give more importance to the left node than right node (based on your example of T/F). This might help clear me doubts

Left leaf is 9/15. It is 9 because it has 9 samples in total (8T + 1F). It is 15 because the total number of samples involved in this split is 15.

Right leaf is 6/15. It is 6 because it has 6 samples in total (2T + 4F). It is 15 because the total number of samples involved in this split is 15.

Very boring meaning of counting it is.