Is it possible to use Decision Trees for multi-classification?
What is most convenient to use either Decision Trees for binary classification or Neural Network model with Relu and Sigmoid functions? What about linear regression?
Yes, you can use decision tree for multi- classification problems. The choice of models dependa on the dataset and the goals, neural network provides a more complex model. I usually start with decision tree and build all the way up untill xgboost and evaluate the performance, specially if I am using tabular data. With any other form of data that’s not tabular I go to neural network path, it depends on your problem and goals.
Regression models are used for continuos outcome, so using for multi-classification problems is not the best choice.
All of them - decision-tree-based (DT), neural network (NN), logistic regression (LogR, instead of linear regression) are convenient to use because there are available libraries for quick use.
NN requires an architecture which may be slightly less convenient.
NN and DT have more hyperparameters to tune than LogR, so they may be less convenient for beginners.
For tabular data, it is easier to train a better model with DT, so it may be more convenient.
NN and DT enable non-linearity so they are more convenient for modeling non-linear relations.
Regarding convinient I mean what is the first choice you should take when you face a classification problem, DT or NN.
I mean based on I’ve learnt. sigmoid function, loss function, weights and all of them applied in a NN is clear how the algorithm aligned the variables to make a prediction.
Then in DT you use a formula similar to loss function but with entropy based on the frequency of a variable in de source (for instance, the cat i. the group representative in a node)
I’ve used entropy in data compression algorithm to determine the minimum amount of bits required to send a source in a binary channel and
then with huffman you create a tree to encode based on the frequency of the character.
But here the entropy is used different and Iam not seeing it so clear like sigmoid, loss function and gradiant descendent.
If that is a tabular dataset, I may first build a Gradient-boosted decision tree as a baseline. If that is images, all state-of-the-art (SOTA) models are neural networks.
In the scope of this course, decision tree finds the best split by an exhaustive search, and the best split is the one that gives the most information gain, and the gain is defined as the change of entropy (which is not the only way).
Good split → well distinguished → low entropy
Bad split → mixed → high entropy
Any relations indicated by the arrows that are not clear to you?
Cheers,
Raymond
PS1: LogR is my preferred abbreviation for Logistic Regression, to distinguish it from LinR (Linear Regression).
PS2: The scope for DT in this course does not include gradients, sigmoid or loss, but just information gain. Those concepts are heavily involved in gradient-boosted decision tree (GBDT), which, however, is outside of this course. In this course, we do have the entropy formula, which looks like binary crossentropy loss, but we are not discussing that entropy formula as a loss function. In my opinion, talking about loss+gradient because you want to learn GBDT is a good thing, but talking about loss+gradient because we confuse it with the entropy formula is a bad thing. Further into the GBDT requires more readings because it is quite a different algorithm. It is not like a one-line explanation thing. I put this as a PS because I think we are still discussing DT within the scope of this course, aren’t we?
Since I may not be able to check out this place often starting from tomorrow, in case you were actually interested in the details of GBDT and would like some extra readings, you may start with: -
It discusses relevant concepts, and has links to some videos, books and papers.
This one is a bit more technical, with maths, but should give you a more foundamental view. The so called Gradient-boosted Decision Tree is actually just an apporach that combines the more general idea of “Gradient boosting machine” with “Decision Tree” as the machine’s base model. In this one, you will read about gradient and loss function.
These two links should contain (or link to) quite some useful materials from which you may identify further keywords to search for more readings. If there is a library nearby you, I would actually recommend you to see if you can find a textbook. I believe that a helpful book has to be an accessible one.
This topic definitely had worthed an extra week if it was covered in the course, so be ready that it will take quite some time and effort.
Thanks @rmwkwok, I understood that I’ve learned only the surface of DT.
As a side note, regarding the entropy formula, I’ve noticed this one is similar to the loss function in logistic regression, I mean how it is composed.