One-hot-encoding using pandas , getting different numbers of columns for train and test sets

I am doing one-hot-encoding using pandas ,but getting different numbers of columns for train and test sets(note - number of columns in train and test sets are equal before encoding) please help!

Hey @rishabh_singh7,
Before performing one-hot encoding on your datasets, did you compare the sets of distinct values for every column, for your train and test datasets respectively. My best bet would be that for at least one column, the train and test datasets, will have different sets of distinct values, for instance, for some feature = X, the train-dataset has distinct values as {0, 1, 2} but the test-dataset has distinct values as {0, 1, 2, 3}. In this case, one-hot encoding will create 3 columns for feature X in train dataset, but for test-dataset, it will create 4 columns. But I guess this would only happen if you have used separate one-hot encoders for your datasets. So, can you please confirm the answer to these questions first?

Cheers,
Elemento

Its two different csv files ( also big dataset) , so you suggesting i should merge both data set frist do one encoding then again split it in train and test dataset?

Hey @rishabh_singh7,
That’s one way to move forward. But as you mentioned, both are large files. In that case, you can simply check for the set of distinct values in the columns you intend to perform one-hot encoding for, across both the sets, as I aforementioned. Another way to move forward is to use scikit-learn’s one-hot encoder, which you can find here. In this case, if the sets of distinct values will be different across the sets, then it will throw an error by itself, depending upon how you set the handle_unknown argument. Let us know if this helps.

Cheers,
Elemento

Quick question: how many rows are there in these files?

If you are getting different one-hot columns, then the domain of the feature is different, meaning one has one or more values that are not in the other.