Anomaly Detection Improvement Issues

kaian0414 · July 3, 2023, 5:37pm

Hi, I am applying the anomaly detection on my private dataset. The coding is followed the assignment on week 1. I am finding the outliers from the cross-validation set. The cross-validation set include 280 examples, but the outliers results found 267 anomalies. It’s quite an unbelievable results. What can I do? I remember that it mentioned that “feature engineering” is so important on anomaly detection. I am trying on that now. However, I still want to ask for out expects here. Is there any other way to improve the anomalies results? I would like to make it better! Thanks.

TMosh · July 3, 2023, 9:10pm

It depends on your data set and the method you’re using to set the anomaly threshold.
You haven’t provided enough information for a more detailed answer.

Can you provide some simple statistics (mean and standard deviation, for example) for each feature in your data set? That would be a good starting point.

kaian0414 · July 4, 2023, 1:03pm

The epsilon is followed the coding about selecting the best epsilon.

Best epsilon found using val set: 0.2384
Best F1 on Validation Set: 0.7105

One thing that I forgot to mention before. All those features are categories, just like gender (male or female), so that there should be no mean or standard deviation. Also, as the lecture about choosing features, it seems that the features in category formats cannot be showed as gaussian. Therefore, it’s quite unsure for me do choose the better features. @TMosh , can you share some ways to solve this situation? Thanks!

TMosh · July 4, 2023, 2:06pm

I don’t have experience with anomaly detection for categorical features, so I did an internet search and found this paper:

kaian0414 · July 4, 2023, 2:16pm

Thanks @TMosh , let me have a look to learn that. Besides, I got some data about number of complaints that each user raised. Generally, there are only a few user raised a complaint, so that this feature is not constructed as gaussian, and it should be unreasonable to transform it become gaussian distribution. From my point of view, I should throw away such kind of features. Am I right? Thanks

TMosh · July 4, 2023, 2:17pm

Throwing away data is rarely a good idea.

kaian0414 · July 4, 2023, 2:23pm

I have transform the data by using minmax scaler. The plotting result is as follow. It quite odd for me, and it seems there is no obvious distribution for me. Can you have a look on that to see what can I do? Thanks. Actually, I have several features similar as this one. If I throw away all those of them, there will be lack of features, however, if I put all of them here, those features are not gaussian. Thanks!

TMosh · July 4, 2023, 3:00pm

If those are categories, you need to one-hot code them, then see what guidance that paper has on how to handle it.

kaian0414 · July 4, 2023, 3:01pm

The above data from the diagram are numeric data… Just because of the high radio of having 0.00.

TMosh · July 4, 2023, 3:05pm

If they are numeric data, then you can compute their statistics directly. You may need to look at a catalog of different distributions to find one that seems to fit better than pure gaussian.

Or you can try an internet search for ways to transform a distribution into gaussian.

kaian0414 · July 4, 2023, 3:32pm

Thanks for your suggestions. It seems that there are quite many different ways to achieve one task. Many things that need to explore…

Aboudramane_DIARRA · July 9, 2023, 9:53am

Hi, I’m also working on anomaly detection. I’m particularly interested in this field. If possible I would like you to help me to know what kind of algorithm is suitable?

Christian_Simonis · July 9, 2023, 10:48am

Hi @Aboudramane_DIARRA

welcome to the community and thanks for your question!

There are several approaches, dependent on which kind of data you have.

In reality often normal data is easy to get (but abnormalities are rare and therefore data with anomalies are expensive to acquire): in this case an unsupervised approach is powerful, training your model only on normal data to learn what “normal” looks like: e.g. you could start with a PCA or if you have big data also autoencoders represent a popular unsupervised solution, see also: Backpropagation algorithm - #4 by Christian_Simonis .
However: If you have rich data for both, “normal and bad” cases, you could go for supervised learning and e.g. classification model like a logistic regression or also a Gaussian mixture model might be worth a look, see also: What will be good machine learning algrothim for this distribution - #9 by Christian_Simonis

If you are looking for an application from reality, this patent application for an early warning system for EV batteries, utilizing a variational autoencoder for predictive maintenance, where I also could contribute a bit, could be worth a look!

Please let me know if this helps!

Best regards
Christian

Topic		Replies	Views
C3_W1_Anomaly_Detection Questions Unsupervised Learning, Recommenders, Reinforcement week-1	2	149	June 1, 2024
Alternative method for Anomaly Detection (Week 1) Unsupervised Learning, Recommenders, Reinforcement week-1	1	22	March 1, 2025
Week 1 anomaly detection -- pass with 100% but it still isn't right Unsupervised Learning, Recommenders, Reinforcement week-1	1	264	December 22, 2023
Anomaly algorithm - video difference Unsupervised Learning, Recommenders, Reinforcement week-1	6	29	July 10, 2024
Many outliers vs real data Unsupervised Learning, Recommenders, Reinforcement week-1	2	430	June 7, 2023

Anomaly Detection Improvement Issues

Related topics