Null values for categorical data set

I am trying to implement a multioutput regression problem on a Measels dataset of schools where the targets are: overall vaccination rate, location and type (public, private or charter) of the school.

However, the dataset has 58% missing ‘type’ feature and 38% ‘county’ feature which both contribute to ‘location’ target. The label that we are trying to predict is largely missing on the dataset. How should I deal with the imputation part?

I read decision trees handles missing values on the training data. But would such a large missing value be properly handled?

What would be other ways to deal with large categorical missing values?

Hey @Arisha_Prasain,
I am a little confused with what you have described. Can you please state some facts more clearly?

For instance, I am assuming that in your problem, you have to predict 3 things at stated above.

Now, you have said, that the dataset has 58% missing type feature, so how does the dataset has both type as a target and as a feature?

The imputation techniques that are applied in decision trees, you can also apply for other models as well, so if it can be properly handled by a decision tree, chances are it can be handled by other models as well, but off course, the inherent working of a model will surely affect how the model handles the imputed values.

You can know about some of the ways that are used to perform imputation here.


Hey @Arisha_Prasain,

Before giving my opinion, I’ll make the following assumptions :

  1. overall vaccination rate, location and type (public, private or charter) are the X values that are being used to find some Y
  2. location feature is not limited to a single country

I think it’s important to try to know why the dataset has those missing values, let’s take missing ‘county’ feature values : one of the reasons the value could be missing them is due to not all countries having a county system (China has province, Japan has prefectures, etc)
Similarly, there could be reasons for features having missing value.

One of novel solutions to this, that I see being used in cases where missing values mean ‘something’ is using ‘having missing values’ itself as a feature. Eg : I could create a feature called ‘Has_Type’ or ‘Has_County’ with boolean values while at the same time imputing the missing values in the feature ‘Type’ and ‘County’ with mean/median/mode/0 (depending on the data).

Hope this helps.


Hey @the_sophic,
Welcome to the community.

As per @Arisha_Prasain’s description, “overall vaccination rate, location and type” are targets and not features, although he also stated that type is a feature, so either these 2 are different, or either there is some typo.


1 Like