Tossing out bad examples: Real world production data distribution

wingding · August 11, 2021, 5:24pm

Isn’t there a risk that you bias the training data away from the real world distribution? What about when the bad, irregular data could be reasonably expected after deployment and the model needs to make a decision on it?

JamJab · August 11, 2021, 5:47pm

Aside from the real world distribution, this raises the issue of characterizing without ambiguity what noisy data means so these datasets can be excluded in production.

nasreenpmohsin · August 11, 2021, 5:51pm

Why not create a separate class for very noisy and very ambiguous data thereby teaching model also to distinguish noisy and irrelevant examples from the expected data?

TodoranGeorge · August 11, 2021, 5:55pm

This is a very good question. From my experience when some data is too difficult to label by a human, than the ML model will have limited performance for that cases. From my point of view the correct approach for these cases is to detect these “bad examples” by some algorithm and just output “Not sure”.

pat.ransil · August 11, 2021, 6:27pm

Maybe also have a module that is analyzing the data in real time and looks for shifts in the data and for outliers. This should alert people to look at the data.

Marc.Cox · August 11, 2021, 6:28pm

Clean denoised data super important. Have another question, Dillon / Panel / et al, Any thoughts about AutoML to match the best model to the Data ?

LauraUstariz · August 11, 2021, 6:29pm

This question was answered by Dr. Andrew Ng, Dillon Laird, and Alex Ratner.
(Time Stamp 1:21:35) Event recording

danlee · August 11, 2021, 6:35pm

I think a really solid example that the poster may be asking about is Google Health’s study in Thailand: Google medical researchers humbled when AI screening tool falls short in real-life testing – TechCrunch

wingding · August 11, 2021, 6:40pm

Thanks for the replies! I’ll attempt a summary of my understanding.

@nasreenpmohsin @TodoranGeorge like you suggested and was echoed by the panelists - particularly Dillon, it could be useful to engineer your system to detect these issues in production to kick off some (possibly non-ML or non-machine) intervention.
As Andrew mentioned, DL methods tend to be low bias , high variance so the mismatch in distribution of X in training set vs test set is less an issue once the P(Y | X) remains consistent.

Also identifying issues in X (in a case of harmful subgroup bias) can prompt ideas for new features to collect to reduce the bad impacts of the “bad” cases.

Alex mentioned that systematic application of the data-centric approach helps in making the impact auditable and explicit, which could help teams learn with iterations over time what helps in addressing the production data edge cases.

If I misunderstood or missed anything, drop a comment. Thanks for the discussion!

Topic		Replies	Views
For tip4 - toss out bad example AI Discussions ai-discussions , data-centric	1	56	May 18, 2023
Real world model with data quality Sequences, Time Series and Prediction week-4	1	518	August 9, 2022
Cleaning Up Incorrectly Labeled Data - ML Strategy \| Coursera Structuring Machine Learning Projects week-2	4	232	April 11, 2024
Class imbalance problem AI Discussions	4	85	May 14, 2021
How do you use the model to make prediction on "bad" samples? AI Discussions ai-discussions , data-centric	1	71	May 18, 2023

Tossing out bad examples: Real world production data distribution

Related topics