Encoding sex and poverty as numerical values?

In the labs/assignment, sex and poverty are encoded as 1 or 2 depending on the split.

Would it not be better to one-hot encode the features “is_male” “is_female” to aid interpretation?

I’m not sure how this would affect e.g. shap explanations at the end?

Hi @Amadeus_Stevenson

In general, for the decision tree model used in assignment, one-hot encoding is not needed and so-called label encoding is sufficient.

This is because the decision tree branches on a condition, so it is possible to distinguish by an integer value in the same column.

Regarding the result of shap, the branching rule of the decision tree may change depending on the difference between one-hot encoding and label encoding. Therefore, the results may not be exactly the same.

In addition, applying one-hot encoding will add a new column, so it may be a little difficult to compare the change in the shap value with the label encoding.

1 Like