Hello everyone,
I’m working on a project and want your ideas on dealing with NAN values before applying Tree Ensemble for classification problems.
Some options are,
Just drop rows have NAN values
Univariate Imputation
Multivariate Imputation
What do you suggest?
None of them are very good solutions, compared to having a complete data set.
They all work relatively badly, in different ways.
I think the choice depends on how much data you’re missing. You make a different decision about whether to throw away (N-1) features just because 1 is missing, depending on the magnitude of N.
There’s no harm in trying several methods and picking the one that gives the least-worst performance.