In the statistics of the programming assignment, there are some values marked red with over 90% of the corresponding values missing. Shouldn’t those features being removed as weel, as they don’t give much information anyway?
Your suggestion of eliminating columns containing missing values above a certain threshold is a valid approach for data cleaning. A drawback to this approach is that it views the training dataset in isolation.
When we have access to training / evaluation / serving samples, it makes sense to compare statistics across the 3 splits to understand if the training dataset was constructed poorly. This will help avoid building models that don’t fully understand the data distribution across evaluation / serving splits.
To give a concrete example, we should be able to tell if certain columns are missing values in training set but aren’t missing values in other splits before removing a column. Upon knowing this detail, we can decide on one of the following:
- Problem with the dataset
- Impute missing values
- Remove the column (your original suggestion)