Difference between outlier and anomaly

How would you differentiate an outlier with anomaly? What I know is outlier is point that is at distant location from general distribution (mean). Don’t know how to describe and relate anomaly

1 Like

Hi @tbhaxor

For example a popular approach is that you can learn your normal behaviour as „normal cluster“ and if a certain data point is too far away from this cluster conclude it is an anomaly.

Autoencoders for example are a popular choice for anomaly detection or you have a sufficient amount of normal labels and the problem is suiting. Can you provide more details on your specific problem?

To differentiate you could e.g. check if the distribution assumptions are satisfied in total: e.g. if you are assuming a normal / Gaussian distribution, all normal data should follow this distribution including potential black swan events (i think you refer to them as statistical outliers) that only occur super rarely. After all the normal distribution is defined for an unlimited range. Sampling a very large, sufficient amount of representative data would make sure our true distribution will be approximated in a acceptable manner.

This thread might be worth a look, too: Anomaly Detection with Different Probability Distributions - #5 by Christian_Simonis

Best regards
Christian

@tbhaxor

Anomaly is something that is not normal.

To identify an anomaly, we look at all possible values/outcomes of an activity. Values/outcomes that commonly occur are considered as normal and those that rarely occur are considered as not normal Or an anomaly. For a normal distirbution, we are at liberty to define the boundary that separates the normal from anomaly. Typically, we specify 3 \sigma from the mean or something that doesn’t fall within the 99.7th percentile of values as an anomaly.

If the data is not normally distributed, we can still use IQR (inter quartile range) and box plot to identify outliers.

1 Like

It’s a judgement call based on statistics. There is no definitive answer to your question.

1 Like

Yeah it makes sense. After a lot of search, in some blogs I found it is used interchangeably. There has to be some difference, otherwise it is inefficient to use two words for same thing.

So basically if too much variance then that is outlier, but variance is required in training data to build a general model.

Basically the difference lies in context:

Outlier is when some data points are behaving differently but doesn’t contribute to any bad thing. For example, employees making too high or too low salary.

Anomaly is when something unusual (or unexpected) is experienced by the monitoring system that is alarming. For example, a user who does 50 - 60 usd purchase daily from card, has done 50000 USD purchase.

So I think on broader both are same, but when we add extra information to the problem it will give us whether to use anomaly or outlier for such kind of data points. Anomaly is more of behaviour analysis, unlike outlier.

1 Like

Yes, @tbhaxor, I agree!

It’s all about context here.

  • An example for a statistical outlier would be a temperature in the summer in Germany measured outside which is 40 degree Celsius (104 grad fahrenheit). Super rare, but it can happen.

  • An example for an anomaly might be a temperature, in summer in Germany measured outside at the same spot which is -40 degree Celsius (-40 grad fahrenheit). Think about in this case your measurement device is defect and you have an anomaly which you figure out after checking the device. Here you would not talk about a statistical outlier since it’s out of a completely different distribution and not representative for any temperatures measured with a normal device.

So this is just one example. But here it would make sense to differentiate between outlier and anomaly.

What you think, @tbhaxor?

Best regards
Christian

In addition: in reality exactly this difference is a big challenge. For example in predictive diagnostics for IoT devices it’s all about to judge if a certain behaviour is still normal or borderline (even a statistical outlier) or if it is an anomaly. Based on this judgement then technical measures can be derived to protect users or the system or mitigate risks.

If you are interested in more details and approaches how to tackle challenges like this: in 2020, this patent application was filed, using supervised, unsupervised and (physical) model-based approaches to determine a Residual Service Life based on a Predictive Diagnosis of Components of an Electric Drive System with AI.

Hope that helps, @tbhaxor!

Best regards
Christian