How do you handle missing values in data? Explain in detail assuming different scenarios

Please help me solving this.

Hey @Sandilya_Bonu, Thanks for your post

Okay handling missing values is an essential step in data preprocessing and analysis.
So i will try to cover as techniques as i can to help you.

1. Categorical Data
Scenario you are working with categorical data

Here there are many approaches to handle it

  • Delete rows with missing values: if the missing values are only small fraction and completely random you might consider removing the rows with these missing values.
    Be aware that this technique can lead to loss of valuable information.

  • Imputation (Filling) with mode: Replace missing categorical values with the most frequent category (mode) in the respective column.
    This is simple method but might not be suitable if the mode is not representative or if the missingness is related to specific pattern.

  • Create a separate category: For example category like “Unknown” or “Missing” to represent missing values.

2. Numeric Data
Scenario you are analyzing dataset that includes the ages or income level for example

Some approaches to consider while working with numeric data:

  • Delete rows with missing values: Similar to categorical data, you might consider this approach if the fraction of missing values is small.
    Be careful about potential bias introduced by removing data

  • Mean/Median imputation: Replace missing values with the mean (average) or median (middle value) of the non-missing values in the respective column
    This is simple approach but could distort the distribution of the data if the missingness is not random.

  • Regression imputation: This is kinda avanced technique. You simply predict the missing values using regression model based on other features.

I just covered most common techniques and data types (categorical & numeric) cause your question was a bit general and you didn’t specify specific scenario

I hope it helps you
Best Regards,
Jamal

1 Like

Thanks for the quick reply.

Given 100 features, you have to select 10 features for a 3 class classification problem. What is your approach?

If you don’t mind please look into this also.

PCA is a useful technique for this.