One-hot-encoding using pandas , getting different numbers of columns for train and test sets

rishabh_singh7 · March 27, 2023, 9:23am

I am doing one-hot-encoding using pandas ,but getting different numbers of columns for train and test sets(note - number of columns in train and test sets are equal before encoding) please help!

Elemento · March 27, 2023, 10:44am

Hey @rishabh_singh7,
Before performing one-hot encoding on your datasets, did you compare the sets of distinct values for every column, for your train and test datasets respectively. My best bet would be that for at least one column, the train and test datasets, will have different sets of distinct values, for instance, for some feature = X, the train-dataset has distinct values as {0, 1, 2} but the test-dataset has distinct values as {0, 1, 2, 3}. In this case, one-hot encoding will create 3 columns for feature X in train dataset, but for test-dataset, it will create 4 columns. But I guess this would only happen if you have used separate one-hot encoders for your datasets. So, can you please confirm the answer to these questions first?

Cheers,
Elemento

rishabh_singh7 · April 2, 2023, 9:52am

Its two different csv files ( also big dataset) , so you suggesting i should merge both data set frist do one encoding then again split it in train and test dataset?

Elemento · April 6, 2023, 5:15pm

Hey @rishabh_singh7,
That’s one way to move forward. But as you mentioned, both are large files. In that case, you can simply check for the set of distinct values in the columns you intend to perform one-hot encoding for, across both the sets, as I aforementioned. Another way to move forward is to use scikit-learn’s one-hot encoder, which you can find here. In this case, if the sets of distinct values will be different across the sets, then it will throw an error by itself, depending upon how you set the handle_unknown argument. Let us know if this helps.

Cheers,
Elemento

Juan_Olano · April 6, 2023, 6:10pm

Quick question: how many rows are there in these files?

If you are getting different one-hot columns, then the domain of the feature is different, meaning one has one or more values that are not in the other.

Topic		Replies	Views
Using One Hot Encodings question Improving Deep Neural Networks: Hyperparameter tun	2	519	September 22, 2022
What counts as one-hot encoding Advanced Learning Algorithms week-4	2	337	November 24, 2023
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization - Week3 assignment Improving Deep Neural Networks: Hyperparameter tun week-3	6	279	February 23, 2024
Understanding One Hot Coding Advanced Learning Algorithms week-4	14	670	December 10, 2022
Does it make sense to call it one-hot encoding of something that has only two values? Advanced Learning Algorithms week-4	3	532	June 21, 2022

One-hot-encoding using pandas , getting different numbers of columns for train and test sets

Related topics