Using one-hot encoding of categorical features

abhilash341 · March 10, 2024, 11:24pm

What is the advantage we are gaining by splitting the single feature with 3 categories i.e. Ear Shape into Pointy ears, Floppy ears, Oval ears using one hot encoding as opposed to having three branches in the decision tree when the node is split by the feature Ear shape, where each branch will be assigned a distinct value (category value) of the Ear Shape.
One advantage I could notice is that the Decision tree built with one hot encoding will be simple to understand and analyze with a maximum of 2 branches per node, is there any other advantage?

dan_herman · March 10, 2024, 11:55pm

You can’t have text in the rows when fitting a model.

One hot encoding converts the categorical value to numeric as a “dummy variable”.

rmwkwok · March 11, 2024, 2:03am

Hi @abhilash341,

Besides @dan_herman’s answer, what other problems can you think of without one-hot encoding them?

Let’s say we use numbers to represent ear shape: 1, 2, 3 stands for pointy, oval, and floppy respectively. Given that you know how splitting works (as lecture explained that part), what are the possible splits?

What if you change the order of those shapes, that now, 2, 1, 3 represents pointy, oval, and floppy respectively?

Cheers,
Raymond

abhilash341 · March 12, 2024, 3:00am

Given that you know how splitting works (as lecture explained that part), what are the possible splits?

What if you change the order of those shapes, that now, 2, 1, 3 represents pointy, oval, and floppy respectively?

this is my intuition.

rmwkwok · March 12, 2024, 3:15am

Hello @abhilash341,

This is not how most decision tree packages work, they do not give you a three-way split, instead, just a two-way split. For example, it can split at between 1 and 2 so that you have a group of (1) and a group of (2, 3). It can split at between 2 and 3, so that you have a group of (1, 2) and a group of (3).

So, do you see any problem? Is there anything that only one-hot encoded features can give you that the above two-way splits can’t?

Raymond

TMosh · March 12, 2024, 8:04am

Generally you should not use an enumerated list to identify different classes. This implies an unnecessary (and probably incorrect) linear relationship between the classes.

Topic		Replies	Views
Why applying one-hot encoding Advanced Learning Algorithms week-module-4	9	302	December 4, 2023
Why one hot encoding is needed? Advanced Learning Algorithms week-module-4	3	28	June 1, 2025
One hit encoding Decision Tree Data Preprocessing data Advanced Learning Algorithms week-module-3	1	392	November 15, 2023
Isn't it a BAD idea to use one-hot encode for Decision Tree models? Advanced Learning Algorithms week-module-4	6	2063	December 1, 2022
One-hot encoding vs. n-ary decision trees Advanced Learning Algorithms week-module-4	0	298	November 5, 2023

Using one-hot encoding of categorical features

Related topics