Is it a good practice to remove anomalies from training set ahead of time? If so, are there any automated or recommended ways to do this?
That’s a good question, I think it depends on many factors, including the type of problem you are trying to solve and the specifics of the dataset you are working on. I have previously found this article very helpful and methodical on different ways to remove outliers in your data:
Hello Maxim @Maxim_Kupfer,
If you want to train a supervised ML model for detecting anomalies then it’s not a good practice to remove them. If the production environment of your model sees anomalies quite often and it is the model’s job to identify them, it may not be a good practice to remove them. If those anomalies occurs only in your training set and you know it is because of, for example, wrong labels, then it can be a good idea to remove them or correct the labels.
So you see my suggestion would be based on situation and understanding of the anomalies. They can be related to the data quality, domain knowledge, and your model’s scope. There is no one rule for this. Also there is no automated way for removing any sort of anomalies.
Sometimes people do use gaussian assumption to filter out samples that are at the tails of the distribution for examination. Sometimes people use extreme quantiles to do the filtering. We filter out those samples for examination and some decision making about it.
However, filtering out is filtering out, and it doesn’t necessarily mean to screening out, which is particularly true when your sample has too many features that the chance of a sample falling into the tail range in one feature will become pretty high, and as a consequence, it is not unlikely that half of your samples can become “anomalous” because of that simple tail rule. Therefore, I won’t recommend “automatically” screening out samples even using this “tail range” method.
Data scientists spend most of the time on dataset, and considering for how to identify outliners/anomalies, how and when to use them, whether and when to drop them, and so on is a part of that work.
Hi @rmwkwok, let me make it a little more tangible since I understand it is so case-by-case.
Here is what my datasets distribution looks like even after taking it to the 10th root:
Would it make sense to train my model with the HUGE outliers taken out. Otherwise, I don’t see how I can create a bell shaped gaussian curve without running into rounding errors.
Thank you sir. If you look at my most recent reply in this thread, you’ll see a more tangible example. My data has some clear outliers that are astronomically larger than any other point. Would it make sense to remove these obvious ones so that I can have a simpler time with creating a gaussian bell curve for a feature
My response is here.