Why applying one-hot encoding

I have two questions regarding one-hot encoding, please help me understand them:
1/ One-hot encoding is used for categorical features in, for example regression model or neural network, to make their numerical value more meaningful. For example if the feature cat’s ear has 3 types of shapes and we numbered it like 1, 2, 3, it will not contribute to the model the accurate meaning (it can be something like the ear type 2 will contribute to the price of the cat twice as the type 1, but it does not). Do I understand those correctly?
2/ In decision tree, why do we need to apply one-hot encoding, instead of just split the tree to the same number of categories? Is it because that there is a problem with calculating the entropy, or something else?
Thank you!

Hi @francesco4203

Yes, we can speak about the contribution of a feature in a given model by its weight, but I must emphasize that any statement is bounded to the model, but not as a general truth.

The problem is the same as we have in a linear regression model - the arbitary ranking - 3 different shapes may result in 6 different rankings. Obviously, a different ranking can result in a different decision tree model of a different performance. How can we get rid of this uncertainty? One-hot encoding.

So you are saying if there are 10 categorical values, then it split the feature into 10 sub-branches?

What do you think?


So it’s something like, for example, there are 10 shaped of the animal ears, and most cats have ear of shaped 1, then deciding if an animal is a cat or not should be based on whether its ear is of shaped 1 or not, and the rest 9 shaped is not that important for the decision, is it correct?

How did you come to that? I don’t quite follow it. Anything in my last reply that leads you there?

You were saying about the ranking, so I understand that different categories of the same feature might have different contributions to the decision, like splitting using one category might results in lower entropy than the others.

I talked about the ranking, and said that, with 3 categorical values, there can be 6 rankings. Let’s say we have square, round, and triangular shapes and we label them as S, R, and T respectively.

The 6 rankings are:

  1. S R T
  2. S T R
  3. R S T
  4. R T S
  5. T R S
  6. T S R

In the first ranking, we are implying that S < R < T because S is 0, R is 1, and T is 2.

This (S < R < T) has an implication to the linear regression model and to the decision tree’s splitting algorithm.

We can ask ourselves: how do we pick one out of the above 6 possibilities? Are we aware that we are actually making a pick at all?

Different rankings have different implications.

To get rid of these differences, we one-hot-encode.

Above is all I wanted to say. As for whether a shape is important or not, the optimization of our linear model/decision tree will decide :wink:

1 Like

Oh I understand that now. Thank you!
But still, in the decision tree, if it split the feature into 10 sub-branches, what is the problem? Isn’t it eventually split into the total 10 sub-branches if we sequentially split 10 new features that are one-hot encoded?
Then I understand that the problem is that it might not gonna split the whole 10 features but might just choose some of them during the process to split. Is it correct?

Yes. Splitting into 10 may not be optimal. Decision tree only splits but do not re-group. If we only split one into two each time, then we have the freedom to finally end up with 2, 3, 4, or 10 branches, whichever optimal.

I completely understand now.
I’m really appreciate for your help, have a nice day :wink:

You are welcome! Since your questions are cleared, you may want to read about Optimal Partitioning and Target Encoding. They allow you to not one-hot encode a categorical feature, but be careful that they have their own pros and cons.

1 Like