Question about TFDV statistics in programming assignment

In the statistics of the programming assignment, there are some values marked red with over 90% of the corresponding values missing. Shouldn’t those features being removed as weel, as they don’t give much information anyway?

Your suggestion of eliminating columns containing missing values above a certain threshold is a valid approach for data cleaning. A drawback to this approach is that it views the training dataset in isolation.

When we have access to training / evaluation / serving samples, it makes sense to compare statistics across the 3 splits to understand if the training dataset was constructed poorly. This will help avoid building models that don’t fully understand the data distribution across evaluation / serving splits.

To give a concrete example, we should be able to tell if certain columns are missing values in training set but aren’t missing values in other splits before removing a column. Upon knowing this detail, we can decide on one of the following:

  1. Problem with the dataset
  2. Impute missing values
  3. Remove the column (your original suggestion)