Anomaly detection - subpopulations , narrow normal distributions and false positives

Hi

I am working on an anomaly detection where we need to detect outliers on one specific KPI. Hence, the model itself is not complicated. On the other hand, the data on which it runs is very complicated. It is telecommunications data. We monitore calls from different carriers , locations etc. The distributions of the specific KPI is very different between carriers, locations, call directions etc. That is, what seems like an outlier for carrier 1 is very normal for carrier 2. Hence, not only the mean but also the std of the data in question varies a lot between carriers , locations etc.

My idea was to standardise within each group . For example, one can create the Z score by using the mean and std for the specific carrier , location etc.

However, here a different problem arises . In some cases the distributions are very narrow and some KPIs are wrongly marked as outliers although they belong into a normal range. For example , take a distribution with mean 60 , min 45 , max 70. I am only interested in fluctuations below the mean. In this specific distribution the waste majority of the cases are around 60 and only few cases are around 45. However, this does not make 45 an outlier in the practical sense.

This brings me to the more general question. If one trains data with distributions that actually do not include any outliers. How can one eliminate the waste majority of false positives (false alarms/outliers) detected by the model? does heavily left skewed distributions not only allow for proper outlier detection?

to summaries my two questions.
how would you recommend to normalise the data that come from different populations (e.g. carrier)?
how to deal with distributions that actually do not contain outliers in a practical sense (are not left skewed enough, or too narrow)?

and just for curiosity , I actually ask myself why anomaly detection that is based on probability distributions and densities is part of machine learning. I don’t see where the model “learns” . Rather , I am looking at samples of populations and setting a cutoff based on some deviations. This is especially true, when the training data does not contain any outliers. Where does the “learning” actually take place ?

Many thanks
Victoria

1 Like

Just a couple of thoughts, because statistics are not a strong topic for me.

If you know the distribution is one-sided, then you should model it using a one-sided distribution. Not everything fits a gaussian/normal distribution.

If you don’t have enough data to form a reliable model, you need more data.

It’s an open question to what degree “unsupervised learning” methods are truly part of “machine learning”, as opposed to being some variety of automated statistics process.

You could say that “unsupervised learning” represents some tools you might need in developing a supervised learning algorithm.

Can you share some distribution plots and describe the problems related to them?

1 Like

Hi .

Thanks for the response. My data is just extremely heterogenous. It is expected that different carriers, destinations etc perform very differently.Hence, it is more meaningful to do an anomaly detection for each route (taking carrier, destination etc into account) independently .

I have tried to group the data by a K-means algorithm . These are the resulting distributions of the specific kpi value i am looking at.

However, the diversity of distributions (coming from different carriers, destinations) within each cluster is still so massive . Here just some plots that all belong to cluster 0 :

I guess that one has to do an anomaly detection for each route specifically or to set up a K-means that definitely clusters the routes in a better way.

it is a very interesting topic how heterogenous data affects the ml algorithms.

Do you know any literature on this topic ?

Cheers
Victoria

Hi @Victoria_Schroeder,

Which plot?

And what is the problem of being narrow?

Which plot shows normal data that are marked as outliers?