Why one hot encoding is needed?

When creating a post, please add:

  • Week # must be added in the tags option of the post.
  • Link to the classroom item you are referring to:
  • Description (include relevant info but please do not post solution code or your entire notebook):

In video andrew sir describes that a node can have 3 branches, so i’m not sure why would i do transform my data to perform one-hot encoding and increase its size, even when decision tree algorithm can work on string data.

Decision trees are designed to make true/false decisions.
You cannot do that if the target variable is an enumerated set of integers.
So you need a logical output for each label.

Thanks for replying @TMosh
But i do have some follow ups, if u can help to resolve them it would be great.

  1. In my case, the target variable is binary, so that shouldn’t be a concern.

My question is about one of the input features which has 3 distinct classes (e.g., “Red”, “Green”, “Blue”).

Since decision trees can create multi-way splits (e.g., one node branching into 3 based on feature value), why is one-hot encoding necessary? Doesn’t that just increase dimensionality unnecessarily?

  1. Also, for a multi-class target variable, can’t the decision tree simply split as:
Root → Is class A?
   ├── Yes → Class A
   └── No → Is class B?
           ├── Yes → Class B
           └── No → Class C

Wouldn’t that still work without requiring one-hot encoding?

Without one-hot coding, the model may unintentionally learn some implied linear relationship among the values of that feature.

For a simple example, if you have a “lifeform” feature, and the candidates are 1=amoeba, 2 = elephant, 3 = bacteria".

The sequence of values would imply that amoeba is more closely related to an elephant than to a bacteria.