Non binary decision tree

Hi,

Over the week 4 of this course we went through decision trees for deciding whether an animal is a cat or a dog. That’s a binary problem. We have also explored regression trees for numerical values. However I haven’t heard anything about multiclass classification. For example, what if I wanted to build a tree to predict the breed of a dog? Is that something we would do?

Thanks

idea generator

Thanks for the link but I’m more interested in the theory than in the implementation. Isn’t it covered by deeplearning.ai classes?

Hi @soufg

I could try to complement @ai_curious answer:

Yes, of course.

It’s practically impossible to cover each machine learning approach to every potential machine learning problem. The same binary (cat vs dog) problem easily extends to multi class problem - instead of having leaf nodes predict 0 or 1, the leaf node can be 0, 1, 2, … n class. (I might be mistaking but I think it was mentioned in the course?).

What do you mean in particular? Generally, in classification problems, as the number of classes increases, correctly classifying an instance is usually harder because the required number of data points increases with every class (without generalization capabilities - exponentially) and also there are labeling challenges (getting the labels is the problem and miss-labeled data points is also a problem.)

Cheers

Thank you.

What I mean is that in the class we covered that we can predict a binary class using the information gain formula using left/right parameters.

We also covered numerical values predictions using variance reduction.

I’m curious to know if there is an established formula such as the “information gain using left/right nodes weight and entropy” but for more than two nodes? Would that work well if I replaced

w_left, p_left, w_right, p_right (as per the week 4 course)

with let’s say for 3 classes for example “is a dog”, “is a cat”, “is something else”:

w_1, p_1, w_2, p_2, w_3 p_3.

So I get 3 weighted entropy values, from which I get the average.
But maybe other problems arise in this situation and neural networks are just better for that?

Yes I understand that. With that said, is there a good place to get to know more about all these things? I’m really new to this field so I have no idea where to find good documentation and I’m very curious to know more.

Thanks!

The objective of providing that link was to quickly show 50-some different algorithms that support one form or another of multi-class, multi-label or multi-output (and their combinations) classifiers. Some of the basics are covered in MLS, though in my experience all of these classes focus more on implementation, or at least application, than theory. If that is really your focus, you could take that list and find the papers that introduced and explained their invention. None of these class videos or programming exercises will go that deep. Cheers.

I guess this example would clarify things.

In general, Information gain lets you choose a split on which the classes get “purer” (and the classes do not have to be 2). If one particular split gets you the max information gain, then you use that particular split.

As in @ai_curious link, for example, Decision Tree Classifier lets you choose the criterion “entropy” which is what used in the course.

You started in a good place (MLS Course) and from here you have more questions which is great :). But to state the obvious, it depends on the interests and goals you have because depending on that there are different paths to the answers. For example, if you are more oriented to the application, @ai_curious linked sklearn library docs are good place to dig in. On the other hand if you are more interested in research, then published papers is the more obvious place to go.
In simple words, Wikipedia can show you the branches, but getting deep knowledge requires a lot of effort and sometimes internet is not enough (even though you can find almost anything you search for).

Do you understand I’m an absolute beginner in this field? Sending a link without explaining what you explained in this message was not that helpful.

Thanks

Thank you. Indeed at the moment I’m more curious about how things work and what’s the stats of things rather than applications.

I will dig online to see if I can find answers to my questions. I already found a few things by looking at wikipedia’s references that I didn’t think of in the first place.

Entropy H(p_1), H(p_2),H(p_3)… etc will all get the same entropy value for a given node. The entropy formula is simplified in the above slide to make it easy to understand the calculation in the below slide

image

Entropy is more a property of node so for a given node it will be same. If nodes have notations like [C_1,C_2,C_3,......,C_J] and k classes in dataset then Entropy for given node C_j should be

H(C_j) = -p_0log_k(p_0) -p_1log_k(p_1)+.......-p_{k-1}log_k(p_{k-1})

here p_0 is fraction of class 0 at the node we are calculating entropy.
and p_1 is fraction of class 1 at the node we are calculating entropy etc

The information Gain formula should be

IG = H(C_{parent}) - (w^{left} H(C_{left}) + w^{right} H(C_{right}))
Here w^{left/right} = (Total number of elements in left/right node)/(Total Number of elements in parent node)

For Example lets consider dataset [cat1, cat2, cat3, dog1, dog2, dog3, dog4, mouse1, mouse2]
There are 3 cats, 4 dogs and 2 mice. Lets consider a decision tree as below

image

Entropy for C_{parent} node is

H(C_{parent}) = -p_{cat}*log_3(p_{cat}) -p_{dog}*log_3(p_{dog}) -p_{mouse}*log_3(p_{mouse})
here p_{cat} = 3/9, p_{dog} = 4/9, and p_{mouse} = 2/9

H(C_{parent}) = (-3/9 * log3(3/9)) - (4/9 * log3(4/9)) - (2/9 * log3(2/9)) = 0.966

The above entropy will be “1” if all the elements of the dataset are equal, i.e, 3 cats, 3 dogs, 3 mice. Similarly, you calculate H(C_{left}), H(C_{right})

H(C_{left}) = (-2/5 * log3(2/5)) - (1/5 * log3(1/5)) - (2/5 * log3(2/5)) = 0.96
H(C_{right}) = (-1/5 * log3(1/5)) - (3/5 * log3(3/5)) - (0/5 * log3(0/5)) = 0.57

Finally, Information gain is calculated by
w^{left} = 5/9, w^{left} = 4/9
IG = H(C_{parent}) - (w^{left} H(C_{left}) + w^{right} H(C_{right}))
IG = 0.97 - (5/9 * 0.96 + 4/9 * 0.57) = 0.183

Without context of how other features will perform we cant say if the above IG is good or bad but the right node seem to do good job of isolating dogs at least but we also have to consider that more elements are in left node and their entropy is very high which is bad.