I’m working on a project and want your ideas on dealing with NAN values before applying Tree Ensemble for classification problems.
Some options are,
Just drop rows have NAN values
What do you suggest?
None of them are very good solutions, compared to having a complete data set.
They all work relatively badly, in different ways.
I think the choice depends on how much data you’re missing. You make a different decision about whether to throw away (N-1) features just because 1 is missing, depending on the magnitude of N.
There’s no harm in trying several methods and picking the one that gives the least-worst performance.
Thanks for the reply.
Please guide me, on where to go from here,
Steps I already applied,
Removed columns having more than 50% of rows are null.
Changed categorical columns using (get. Dummies)
When used this,
model = DecisionTreeClassifier(min_samples_split = min_samples_split,
random_state = RANDOM_STATE).fit(X_train,y_train)
There’s a lot of advice in the error message. I don’t have much to add.
You could try replacing all of the NaN values with the mean value for that feature.