Isn't it a BAD idea to use one-hot encode for Decision Tree models?

If a variable can take say 10 values, e.g VARA can take the following values: cat1, cat2 … cat10. By one-hot encoding VARA we are limiting to looking at any node to split on only of the possible values, for example only is it cat3 or not.

BUT we may be able to SIGNIFICANTLY reduce even more the entropy if we choose a combination of values , say … [cat1&cat2*cat3] ye or no, … or … [cat5 & cat10] yes or no … or even [cat1 &cat2 and …cat10] yes or no.

Because we are reducing to ONE cat value at a time to enter, we are significantly reducing the probability of VARA being chosen, particularly if the there is binary variable or a numerical value to compete against.

Mt understanding is that we should use Tree based algorithms that treat Categorical variables WITHOUT hot-encoding. That is, they actually try permutations of Categorical variables (up to a point) to identify what is the best combination of VARA values to minimize entropy for VARA entering a node.

One-hot works well with Linear Regression. Is it recommended for NN?

Hi @Oscar_Rosen,

We need one-hot encoding for a categorical feature when the feature values do not represent ordering and merely for distinguishing one from another. Let’s say you have a categorical feature with 5 different values - [0, 1, 2, 3, 4], and ONLY sample 10 has a value of 2 whereas ONLY sample 15 has a value of 4. In this case, even if you swap their feature values to make sample 10 becomes 4 and sample 15 becomes 2, your dataset remains valid, because the samples are still distinguishable the same way from one another by that feature. Therefore, a categorical feature is not value-sensitive, meaning you can freely assign a different feature value to the group of samples in the same category.

However, your decision trees is value-sensitive, and they can behave differently before and after the swap because it makes a split by a split-point and group those with a value lower than the split-point to one leaf node and the rest to another leaf node.

To get rid of this difference, we one-hot encode such categorical features, so that the performance of the trees no longer depends on the feature values of the one-hot encoded categorical features. This holds true for Linear regression, and NN.

If you do not one-hot encode your features, it is true that you will need up splitting a categorical feature into a leaf of [1, 2, 3] and [4, 5], and may or may not get to a better modeling result. Decision trees are value-sensitive, but categorical feature values are not. In order to achieve your desired performance boost, you need a way to set the values instead of randomly assigning them. This can take a lot of time for you to try how to set the values.

Lastly, if your categorical feature is actually a ordinal feature, meaning that the values representing ordering. For this case, you can skip the one-hot encoding and use it like a numerical feature and likely will get a better performance over the one-hot encoded version.

Cheers,
Raymond

It is true that scikit-learn RandomForest and some implementations of XGBoost only allow you to use numerical inputs and then you are forced to encode a Categorical variable. In this case, one-hot is better than label encoding for all the reasons that you mentioned.

I am not advocating to use label encoding for a Categorical variable. Many Tree based algorithms, like LGBM, do NOT require one-hot encoding. THis is because one-hot encoding forces to select only ONE of the values of a categorical variable at a time. The variable will be selected if, BY ITSELF, can reduce the Entropy more than any other variable.

That is is why I would rather use LGBM, since this package allows to specify a variable as Categorical and will treat them properly by trying combination of the Categorical variable at any new node.

Because one-hot encoding limits how VARA can be split (only allowing 1 var vs 4 var) it makes it LESS likely to enter at any node, vs trying all combinations of VARA levels. Assume that the VARA can take the following 5 values [a,b,c,d,e] . Under one-hot, at any new Node you can only split VARA in one of the values vs the remaining 4, e.g., [c] vs not [c], or [e] vs not [e]… BUT a better split could be [c,e] vs [a,b,d] or [b,d] vs [a,c,e]. You cannot get the later at one node with one-hot encoding.

From | notebook.community

One-hot encoding also presents two problems that are more particular to tree-based models:
** 1. The resulting sparsity virtually ensures that continuous variables are assigned higher feature importance.**
** 2. A single level of a categorical variable must meet a very high bar in order to be selected for splitting early in the tree building. This can degrade predictive performance**”

My point is, if your model has Categorical inputs with more than 2 levels (not binary), you should be using Tree algorithms like LGBM that allow you to specify the variables as Categorical and will use combinations of the variable levels, as opposed to try one level of the variable at a time. If you one-hot encode, the higher the number of levels in a Category, the less likely that the variable will enter at any new node.

Hello @Oscar_Rosen,

Thank you for bringing more context into the the discussion. I was tailoring my answer to the general idea of using one-hot encoding. It is always good to keep it in toolbox in case we are dealing with a situation that we are out of other options.

Thank you for suggesting for a better way to deal with categorical variables. Popular tree-based packages like LGBM and XGBoost have their ways to deal with categorical variables as you said. I am providing references (xgboost, LGBM) here for you and other learners who read this post.

Cheers,
Raymond

1 Like

Hi Raymond,

I am wondering using one-hot encoding for categorical variables: does this not invoke multicollenarity problem when using a logistic regression, neural network, decision tree or ML methods in general?

Thanks,
Shilpi

Hello Shilpi @Shilpi_Kumar ,

Welcome to the community!

If we one-hot encode a BINARY variable, then the 2 resulting features are perfectly uncorrelated. We don’t want that, and we actually don’t need to one-hot encode a binary variable.

If we one-hot encode a variable of 3 or more classes, given that the variable isn’t using only 1 category value, then the correlation shouldn’t be perfect. However, they can have non-zero correlations.

Now comes some discussions when two features are somewhat correlated.

Can we ignore some of the (non-perfectly) correlated features?
Not easily.

Even if we speak about numerical features, they are often correlated to some degree. Leaving out one of them needs justification more than just the correlation value. Do we know clearly the causal relationship between the one that is left out and the others that remain?

Also, correlated features are often, if not always, the case.

Will having colinear features reduces the model’s performance?
Generally speaking, under the right choices of hyperparameters, it shouldn’t.

Consider two perfectly correlated features x_1 and x_2 in a linear regression where weights w_1 and w_2 are assigned to them. We can easily see that w_1x_1 + w_2x_2 = (w_1 + w_2)x_1 = w_1'x_1 which is reducible to only using one of the two features. However, this gives rise to the problem of interpreting w_1 and w_2. Since any combination of them will give the same w_1' = w_1 + w_2, such as that (w_1=1, w_2=2) and (w_1=-10, w_2=13) are equally good, we can’t talk precisely about the individual features of x_1 and x_2. Similar effect will happen on partially correlated features.

However, if we have 1000 features in total, but 999 of them are perfectly correlated, this may be harmful to a gradient-boosted decision tree which enabled feature selection. Because a feature is selected by chance, those perfectly correlated features can overwhelm all the trees, leaving the single and different last feature plays little or no role in the final model.

Therefore, if we know two features are perfectly correlated, then we will keep only one. Otherwise, which is the case for one-hot encoded features, we will use them with care.

Cheers,
Raymond

Ho Raymond,

Thanks for your prompt response and sharing the explanations. This makes sense. I would expect in practical, there will always be some correlation among features. Even classical statistical theory says that it’s okay. The true problem in statistics only occurs when there is perfect multicollinearity which I. Turn violates some of the regression assumptions and invalidates the classical inference tests like t-test, F-test etc.

I gather form the discussion that we need to be alert if we have near perfect correlations among features as it may hamper some algorithms performance.

Best,
Shilpi