General methodology for handling missing data in training examples

Hi All,

I am curious to know what are the most common ways to handle missing data in training examples for machine learning and deep learning algorithms. For example, if we are predicting a house price and have 5 features: sqft, bedrooms, bathrooms, floors and year-built. For some of the training examples, we may not have the year-built data and for some other examples, we may miss bedrooms data. When we apply the machine learning algorithm to predict the house price, how should we handle these missing data?

One way I can think of is to prepare the training data so that we can populate the missing data with some predicated value, like the average of all other examples for that feature. Would this be a good way to handle missing data? Or can we do anything at the run-time to let the algorithm handle the missing data automatically for us?



Yes! To handle missing data we can do actions such as dropping the rows with empty, replacing the empty values with the mean, or even generating the missing values using another ML algorithm.
Check out this Kaggle tutorial on how to handle missing values in a dataset Missing Values | Kaggle

I don’t know of a general way for algorithms to handle missing data automatically. In Natural Language Processing there are ways for those algorithms to handle words they’ve never “seen” before, but that’s about all I know.

Hope this helps!

The collaborative filtering method discusses a way to mask missng values from being included in the cost and gradient calculations.

It may be applicable here.

It’s covered later in the course.