Removing anomalies from training data

Maxim_Kupfer · September 16, 2022, 11:29pm

Is it a good practice to remove anomalies from training set ahead of time? If so, are there any automated or recommended ways to do this?

mrafat · September 20, 2022, 12:08am

Hello @Maxim_Kupfer,

That’s a good question, I think it depends on many factors, including the type of problem you are trying to solve and the specifics of the dataset you are working on. I have previously found this article very helpful and methodical on different ways to remove outliers in your data:

Regards,

Mohamed

rmwkwok · September 20, 2022, 12:25am

Hello Maxim @Maxim_Kupfer,

If you want to train a supervised ML model for detecting anomalies then it’s not a good practice to remove them. If the production environment of your model sees anomalies quite often and it is the model’s job to identify them, it may not be a good practice to remove them. If those anomalies occurs only in your training set and you know it is because of, for example, wrong labels, then it can be a good idea to remove them or correct the labels.

So you see my suggestion would be based on situation and understanding of the anomalies. They can be related to the data quality, domain knowledge, and your model’s scope. There is no one rule for this. Also there is no automated way for removing any sort of anomalies.

Sometimes people do use gaussian assumption to filter out samples that are at the tails of the distribution for examination. Sometimes people use extreme quantiles to do the filtering. We filter out those samples for examination and some decision making about it.

However, filtering out is filtering out, and it doesn’t necessarily mean to screening out, which is particularly true when your sample has too many features that the chance of a sample falling into the tail range in one feature will become pretty high, and as a consequence, it is not unlikely that half of your samples can become “anomalous” because of that simple tail rule. Therefore, I won’t recommend “automatically” screening out samples even using this “tail range” method.

Data scientists spend most of the time on dataset, and considering for how to identify outliners/anomalies, how and when to use them, whether and when to drop them, and so on is a part of that work.

Cheers,
Raymond

Maxim_Kupfer · September 20, 2022, 10:29pm

Hi @rmwkwok, let me make it a little more tangible since I understand it is so case-by-case.

Here is what my datasets distribution looks like even after taking it to the 10th root:

Would it make sense to train my model with the HUGE outliers taken out. Otherwise, I don’t see how I can create a bell shaped gaussian curve without running into rounding errors.

Maxim_Kupfer · September 20, 2022, 10:32pm

Thank you sir. If you look at my most recent reply in this thread, you’ll see a more tangible example. My data has some clear outliers that are astronomically larger than any other point. Would it make sense to remove these obvious ones so that I can have a simpler time with creating a gaussian bell curve for a feature

rmwkwok · September 21, 2022, 12:52am

My response is here.

Topic		Replies	Views
Anomaly Detection vs Supervised Learning Unsupervised Learning, Recommenders, Reinforcement week-1	2	408	May 15, 2024
Many outliers vs real data Unsupervised Learning, Recommenders, Reinforcement week-1	2	430	June 7, 2023
Can Removing Random Training examples in Classification Problems lead to a better Generalised Fit? Supervised ML: Regression and Classification week-3	6	557	July 21, 2022
Anomaly Detection Improvement Issues Unsupervised Learning, Recommenders, Reinforcement week-1	12	526	July 9, 2023
Finding unusual events example: Why unlabeled data Unsupervised Learning, Recommenders, Reinforcement week-1	1	432	July 13, 2023

Removing anomalies from training data

Related topics