Tossing out bad examples: Real world production data distribution

Isn’t there a risk that you bias the training data away from the real world distribution? What about when the bad, irregular data could be reasonably expected after deployment and the model needs to make a decision on it?

5 Likes

Aside from the real world distribution, this raises the issue of characterizing without ambiguity what noisy data means so these datasets can be excluded in production.

2 Likes

Why not create a separate class for very noisy and very ambiguous data thereby teaching model also to distinguish noisy and irrelevant examples from the expected data?

8 Likes

This is a very good question. From my experience when some data is too difficult to label by a human, than the ML model will have limited performance for that cases. From my point of view the correct approach for these cases is to detect these “bad examples” by some algorithm and just output “Not sure”.

3 Likes

Maybe also have a module that is analyzing the data in real time and looks for shifts in the data and for outliers. This should alert people to look at the data.

5 Likes

Clean denoised data super important. Have another question, Dillon / Panel / et al, Any thoughts about AutoML to match the best model to the Data ?

2 Likes

This question was answered by Dr. Andrew Ng, Dillon Laird, and Alex Ratner.
(Time Stamp 1:21:35) Event recording

2 Likes

I think a really solid example that the poster may be asking about is Google Health’s study in Thailand: Google medical researchers humbled when AI screening tool falls short in real-life testing – TechCrunch

3 Likes

Thanks for the replies! I’ll attempt a summary of my understanding.

@nasreenpmohsin @TodoranGeorge like you suggested and was echoed by the panelists - particularly Dillon, it could be useful to engineer your system to detect these issues in production to kick off some (possibly non-ML or non-machine) intervention.
As Andrew mentioned, DL methods tend to be low bias , high variance so the mismatch in distribution of X in training set vs test set is less an issue once the P(Y | X) remains consistent.

Also identifying issues in X (in a case of harmful subgroup bias) can prompt ideas for new features to collect to reduce the bad impacts of the “bad” cases.

Alex mentioned that systematic application of the data-centric approach helps in making the impact auditable and explicit, which could help teams learn with iterations over time what helps in addressing the production data edge cases.

If I misunderstood or missed anything, drop a comment. Thanks for the discussion!

3 Likes