Understanding One Hot Coding

Hi there,


When there are 3 different ear shapes three columns were created using one hot coding


But when there are two different face shapes only one column is used, Why ?

Thanks in advance !!!

1 Like

Hi @Praveen_Titus_F ,

When you only have 2 classes, like 2 different faces, you have basically a binary situation.

In the case of the 2 faces, say Happy face and Sad face:

When the column has ‘1’ in the Happy face, what happens with the sad face? If happy face is 1, then sad face most be zero, right?

Conversely, when the column has ‘0’ in the Happy face, what happens with the sad face? If happy face is 0, then sad face most be 1, right?

That’s why when you only have 2 classes, you can ‘optimize’ your One-Hot Encoding a bit, and instead of using 2 columns, you can use only one.

And this can be generalized for the case of more classes. I invite you to try to solve a more efficient One-Hot encoding for 4 classes. Can you do it with 3 columns?

Juan

Hi @Juan_Olano ,
So if i have 4 categories (Red, blue, green, yellow) in a variable then in total there will be four dummy columns, one for each color.
Am i correct ?

Yes, you can certainly use 4 columns. But my challenge to you is: Can you do it with 3 columns?

For 4 categories in a variable, then 4-1 = 3 dummy columns.
So if there are k categories then k - 1 dummy columns can be use, but this seems to be confusing, that’s the problem.

Ok, let me explain. In your example of 4 colors, red, blue, green, and yellow:

Lets say I use 3 columns, one for red, one for blue and one for green. What happens if all 3 columns are zero?

If all 3 columns, red, blue, green, are zero, then what’s left? yellow, right?

I am not telling you to always use k-1 columns for k classes when one-hot encoding. I am just suggesting a mental exercise to understand your original question.

What do you think?

Does the algorithm automatically assumes if all 3 are 0’s, then the left out one is 1

Just to make sure: the easiest way to do it is having the same number of columns as number of classes:

2 classes, 2 columns
3 classes, 3 columns
and so on…

Going back to the original post of FACE SHAPE: Round/Not Round, yes, the algorithm will learn that automatically. It will know if a shape is ROUND or NOT ROUND.

1 Like

Hi @Juan_Olano
I understand the logic, but how Yellow is assumed in our algorithm

Right, if you give 4 different classes but only 3 different options, when the Ground Truth is not red, blue or green, then it will yield to ‘other’, which in this case ‘other’ is ‘yellow’.

And again, for ‘simplicity sake’, for clarity to any reader of the model, it may be more clear to have as many columns as classes are being one-hot encoded.

And to reiterate, and sorry if I caused confusion, my intention was to help you expand your intuition on the answer to your original question.

Definitely now its making sense, thank you @Juan_Olano for making me understand better.

1 Like

I am very glad!!! Thanks for following through!!! :slight_smile:

1 Like

With two shapes, you can either use two inputs with one-hot coding, or use a single feature and re-frame the data labels as a true/false condition. For example, It might be "Pointy ears? with a true/false answer. False would imply the other ear shape.

There’s no difference in the information content, so the results should be the same.

1 Like

Hello all!

Thank you for this wonderful discussion! I would just like to make a summary of this topic.

In many cases (decision tree in particular, and I would also include neural network, linear regression, and logistic regression), the one-hot encoded variables ain’t considered as a separated group of variables.

In human’s eye, if I represent 4 categories with 3 columns, I can tell it’s the 4th category if all 3 columns equal zero. Here, I am considering the 3 columns as a group and they are related.

In a ML algorithm’s eye, each of those encoded columns are just one of the features and if an encoded feature (representing color = Red) is equal to zero, then it simply means “Not Red” instead of the 4th colors. “Not Red” can mean any of the 3 other colors, instead of just the 4th color.

The situation of a binary (this or that) cateogory is pretty special, because we don’t care the difference between “Not this” and “that”, or we could say, we assume “Not this = that” and such assumption gets rid of the second column.

Raymond

1 Like