Anomaly Detection Improvement Issues

Hi, I am applying the anomaly detection on my private dataset. The coding is followed the assignment on week 1. I am finding the outliers from the cross-validation set. The cross-validation set include 280 examples, but the outliers results found 267 anomalies. It’s quite an unbelievable results. What can I do? I remember that it mentioned that “feature engineering” is so important on anomaly detection. I am trying on that now. However, I still want to ask for out expects here. Is there any other way to improve the anomalies results? I would like to make it better! Thanks.

It depends on your data set and the method you’re using to set the anomaly threshold.
You haven’t provided enough information for a more detailed answer.

Can you provide some simple statistics (mean and standard deviation, for example) for each feature in your data set? That would be a good starting point.

The epsilon is followed the coding about selecting the best epsilon.

  • Best epsilon found using val set: 0.2384
  • Best F1 on Validation Set: 0.7105

One thing that I forgot to mention before. All those features are categories, just like gender (male or female), so that there should be no mean or standard deviation. Also, as the lecture about choosing features, it seems that the features in category formats cannot be showed as gaussian. Therefore, it’s quite unsure for me do choose the better features. @TMosh , can you share some ways to solve this situation? Thanks!

I don’t have experience with anomaly detection for categorical features, so I did an internet search and found this paper:

Thanks @TMosh , let me have a look to learn that. Besides, I got some data about number of complaints that each user raised. Generally, there are only a few user raised a complaint, so that this feature is not constructed as gaussian, and it should be unreasonable to transform it become gaussian distribution. From my point of view, I should throw away such kind of features. Am I right? Thanks

Throwing away data is rarely a good idea.

I have transform the data by using minmax scaler. The plotting result is as follow. It quite odd for me, and it seems there is no obvious distribution for me. Can you have a look on that to see what can I do? Thanks. Actually, I have several features similar as this one. If I throw away all those of them, there will be lack of features, however, if I put all of them here, those features are not gaussian. Thanks!

If those are categories, you need to one-hot code them, then see what guidance that paper has on how to handle it.

The above data from the diagram are numeric data… Just because of the high radio of having 0.00.

If they are numeric data, then you can compute their statistics directly. You may need to look at a catalog of different distributions to find one that seems to fit better than pure gaussian.

Or you can try an internet search for ways to transform a distribution into gaussian.

Thanks for your suggestions. It seems that there are quite many different ways to achieve one task. Many things that need to explore…

Hi, I’m also working on anomaly detection. I’m particularly interested in this field. If possible I would like you to help me to know what kind of algorithm is suitable?

Hi @Aboudramane_DIARRA

welcome to the community and thanks for your question!

There are several approaches, dependent on which kind of data you have.

If you are looking for an application from reality, this patent application for an early warning system for EV batteries, utilizing a variational autoencoder for predictive maintenance, where I also could contribute a bit, could be worth a look!

Please let me know if this helps!

Best regards