C2W4L2 - One-hot encoding


Screenshot 2023-07-29 at 00.21.11

It says:
“First you will remove the binary variables, because one-hot encoding them would do nothing to them. To achieve this you will just count how many different values there are in each categorical variable and consider only the variables with 3 or more values.”
And “Sex” only has 2 values, but also was one-hot encoded.
I am confused how to understand the above sentence here?

Hello @Jinyan_Liu,

In that lab, Sex is a binary variable and doesn’t need to be one-hot-encoded.

I have requested for an update of the lab. Thanks for reporting.

Cheers,
Raymond

Ah thanks!

Original data in “Sex” column has 2 values “F” and “M”. Is it because the value is neither a number, nor 0 or 1, nor True or False, so it was one hot encoded? Just to transform F and M to True and False?

I am wondering for this situation, values of a column only has 2 values, but in text, is there a method to transform them to 0s and 1s?

Hello @Jinyan_Liu,

No. You might transform it to 0 and 1 as long as you believe the feature has only two possible values. For how to transform a feature’s values into 0 and 1, you might simply replace one value with 0 and another value with 1, or you might use an existing library like sklearn’s Binarizer or Encoder. I am sure you can google many examples for such transformation.

I think the above has answered all your questions. Below is a hypothetical case:

Consider that you are building a system and you believe that, even though your feature has only two values in the current dataset, it might take a third value in the future at production time, therefore, you might attempt to make it an one-hot encoded variable. In this case, what possible impacts would there be?

  1. Because there are currently just 2 values, the model is trained on two features that are perfectly (negatively) correlated with one another. This may or may not be a problem.

  2. Because there are currently just 2 values, the model so trained has never learnt exactly for the case that the feature takes the third value. This may or may not be a problem.

If you want to discuss on the above two points, please share your view ;). I also welcome you to bring up any other points. However, ignore it if the hypothetical case isn’t what you are paying attention to now, and it’s completely fine :wink: :wink:

Cheers,
Raymond

Thanks for the clarification of one hot encoding.

I have never thought of this hypothetical case, but it’s very interesting! Using one hot encoding, more values of a column == more features. So in production, there could be more features added. But the “new” features are never considered in training. So will they just be ignored in production?

In neural network, this x_new is not considered. And in decision tree, this x_new is not considered either.

You can get some answers by giving them a try! :wink:

Cheers,
Raymond

Thanks! I will do so!

Another question: the course only mentions one hot encoding when talking about decision tree(because feature will be split according to values), not during neural network. But if using neural network, with data column “ear shape”, has 3 values “pointy” “oval” “floppy”. Do we also use one hot encoding to present the data or we can use numbers ( e.g. 1,2,3) to present the data?

Before I respond, I am interested to know why you would think there is a chance that we don’t need to one hot encode it. Would you mind sharing that with me? :slight_smile:

I will wait for 30 minutes and then I will respond even if you haven’t had a chance to reply.

The answer is: either way.

For “one-hot encoding”, I think you have already known why and how, because it is the approach introduced by the lecture.

For “numbers”, because some popular decision tree library offers to split a categorical feature by a method called “partitioning”. For example, I am quoting from XGBoost’s doc (source):

Here I only intend to let you know that both ways exist, but it does not mean they work in the same way, and it does not mean they will deliver the same result.

However, I will not go any further than that, and I will leave the rest for you to discover by reading that documentation page, googling for relevant articles or discussions, and most importantly, experimenting in the future when the time comes that you have some dataset of interest to investigate their differences and pros and cons.

Lastly, unless you want to explore partitioning, make sure to do one-hot encoding as explained by the lectures.

Cheers,
Raymond

Besides one hot encoding, I can only think of “number” the values of a column.

And speaking of neural networks, I was thinking maybe it accept pure text as values of input data? It doesn’t have to be numbers?

Thank you! I will explore in different ways!

I didn’t speak of neural networks, my previous reply is only about decision trees.

Neural network always only accepts numbers. Even a language model needs us to first convert the text into numbers.

@Jinyan_Liu, I suggest you to

  • for now, stay with one-hot encoding for categorical variables in every problem,
  • google for more examples on how other people work with categorical variables,
  • work on your own project, try how others do categorical variables, and compare others’ approaches with one-hot encoding and your ideas. This is a good way to confirm or test others’ approaches and your ideas, and get the hands-on experience. Data science is about practical work :wink: , and we would better be used to test our own ideas with our own practical works.

We can brainstorm many many ideas, but which works, and which works how well in what situation? I can’t just say yes or no to your ideas unless it is very obvious to be either yes or no, otherwise it should be verified case by case. This is because ideas matter.

Cheers,
Raymond

PS: let’s just focus on decision trees. This is what we have learnt in course 2 week 4. Let’s base discussion on what we have learnt or what we have tried. When you have tried something new and want to share about it, open a new topic and share about it there.

1 Like

Agreed! Thank you so much!

There is one more error in Lab regarding one hot encoding I would like to report.
(“We have one-hot encoded the features(turned them into 0 or 1 valued features)” in C2W4FInalLab)
Turning binary values to 0s and 1s is not one hot encoding, right?

Hello @Jinyan_Liu,

I personally don’t think this is one-hot encoding. However, it is arguable. If you are on the side that it is not one-hot encoding, then I am with you.

Cheers,
Raymond

PS1: I am happy to see your questions, and even though it sounds reasonable to ask another one-hot encoding question in an one-hot encoding thread, still, I would like to ask you to open a new thread for the next time. Here is my reason: I believe a thread starts with some questions/findings, then replies to the questions/findings, then some follow-ups for the replies. This makes it easy to catch some idea of what focus of one-hot encoding to expect in a thread by reading the first post.

PS2: I will close this thread in the next 48 hours if there is no more followup.

Sure! Thanks!