Does it make sense to call it one-hot encoding of something that has only two values?

In the lab it says:

3.1 One hot encoded dataset

For ease of implementation, we have one-hot encoded the features (turned them into 0 or 1 valued features)

But each of the features are not turned into a vector of zeroes with one exception that is 1. They are just turned into the scalar value of 0 or 1

Wikipedia defines it as producing a “group of bits” (1 bit isn’t really a group) One-hot - Wikipedia

Hey @toontalk,
That’s an interesting question. Let me explain to you why we are calling it one-hot encoding, when it doesn’t appear so.

Consider that we have a categorical variable having 2 possible values, “Sour” and “Sweet”, so, let’s say that we have a dataset of 3 points (just representing the dataset with the feature for now)

X = [[“Sour”], [“Sweet”], [“Sweet”]]

X has a shape (3, 1). Now, let’s say we apply one-hot encoding to it, so, we will get:

X = [[1, 0], [0, 1], [0, 1]]

Now, X has a shape (3, 2). Consider the first column as x1 and second column x2. I guess we both agree that this satisfies the definition of one-hot encoding. Now, if we take a close look, we will find that for every sample in our dataset, x1 XOR x2 = 1, i.e., if

x1 = 0 → x2 = 1

and if,

x1 = 1 → x2 = 0

So, what do you think is the need for x2 when we can always find its value from x1? And I guess, you won’t be able to find much need of it, and hence, we can drop it while training our models (Do take a close look at the last point of this post, which defines the possible need for x2).

This encoding scheme is often referred to as Dummy Encoding, and x2 is often referred to as Dummy variable (feel free to refer to x1 as the dummy variable, there is absolutely no harm in that).

Feel free to read more about dummy encoding and dummy variables. Here, I have attached some great resources for your reference:

Now, arises the question, “So, why don’t we refer to it as Dummy Encoding?”. This is because it has been the general convention to include it as a part of one-hot encoding. You can find this mentioned explicitly in the docs of Scikit-Learn | One Hot Encoder. Let me mention it here as well for your reference:

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.

In the description of the drop parameter:

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.

Also, take a close look at this point:

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

I hope this helps.

Regards,
Elemento

2 Likes

Thanks. That all makes sense. Note however that the binary case is a bit special I see that Scikit does handle this and even has a special option for it:

  • ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

Note however that the usual description of “one-hot” is that it turns each value into a data structure with exactly one element having the value of “one”. If “sour” becomes 1 then “sweet” becomes 0 and there is no “one” in the one-hot value for “sweet”.

My view is now that it isn’t wrong to call the binary case one-hot but it isn’t helpful either. I’m pretty sure that when Andrew gave examples with two possible values and then replaced them with 0 and 1 he never referred to the process as one-hot encoding.

Hey @toontalk,

This pretty much sums it up. If you want, you can refer to it as dummy encoding, or if you want, you can refer to it as a special case of one-hot encoding (i.e., one-hot encoding + removing the redundant variable). Either way, the concept remains the same.

Cheers!