Hey @toontalk,
That’s an interesting question. Let me explain to you why we are calling it one-hot encoding, when it doesn’t appear so.
Consider that we have a categorical variable having 2 possible values, “Sour” and “Sweet”, so, let’s say that we have a dataset of 3 points (just representing the dataset with the feature for now)
X = [[“Sour”], [“Sweet”], [“Sweet”]]
X has a shape (3, 1)
. Now, let’s say we apply one-hot encoding to it, so, we will get:
X = [[1, 0], [0, 1], [0, 1]]
Now, X has a shape (3, 2)
. Consider the first column as x1
and second column x2
. I guess we both agree that this satisfies the definition of one-hot encoding. Now, if we take a close look, we will find that for every sample in our dataset, x1 XOR x2 = 1
, i.e., if
x1 = 0 → x2 = 1
and if,
x1 = 1 → x2 = 0
So, what do you think is the need for x2
when we can always find its value from x1
? And I guess, you won’t be able to find much need of it, and hence, we can drop it while training our models (Do take a close look at the last point of this post, which defines the possible need for x2).
This encoding scheme is often referred to as Dummy Encoding, and x2
is often referred to as Dummy variable (feel free to refer to x1 as the dummy variable, there is absolutely no harm in that).
Feel free to read more about dummy encoding and dummy variables. Here, I have attached some great resources for your reference:
Now, arises the question, “So, why don’t we refer to it as Dummy Encoding?”. This is because it has been the general convention to include it as a part of one-hot encoding. You can find this mentioned explicitly in the docs of Scikit-Learn | One Hot Encoder. Let me mention it here as well for your reference:
The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.
In the description of the drop
parameter:
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.
Also, take a close look at this point:
However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
I hope this helps.
Regards,
Elemento