Anomaly algorithm - video difference

Hello, I’m writing this email because I’ve noticed a discrepancy, or at least I understood as it, between the explanations in 2 videos:
Video 1: Developing and evaluating anomaly detection
Video 2: Anomaly detection vs supervised learning.

Video 1 says you should choose a dataset to validate that the algorithm performs correctly, and divide them among Training Set, Cv, and Test. It indicates “good engine” is y = 0 and “anomaly engine” is y = 1.

But then Video 2, indicates that you should choose when you have more negative examples than positives.

I would like to understand it deeper because it’s confusing when you listen to one video and the other one.

Thanks.
Regards.
Gus

Hi @gmazzaglia

The key difference between the two videos lies in the context of the datasets they refer to:

  • Video 1 explains the process of developing and evaluating an anomaly detection algorithm by splitting the dataset into training, cross-validation, and test sets, labeling “good engine” as y = 0 and “anomaly engine” as y = 1 .

  • Video 2 highlights that anomaly detection is typically used when there are many more negative examples than positive examples (anomalies, ( y = 1 )). This imbalance is a common scenario in anomaly detection problems.

Hope it helps!

1 Like

Hi @Alireza_Saei , thanks for your reply, please, Can you explain it more deeply?
I mean, based on what I’ve learned, the anomaly case occurs because is out of the normal / Gaussian distribution that the media and standard deviation indicate that everything is ok, so something out from there is an anomaly.
If I have a lot of negative examples, how can I detect an anomaly? they will belong to a normal distribution.

Thanks.
Regards.
Gus

Hi @gmazzaglia,

That’s a good question! Anomaly detection is primarily used in scenarios where anomalies are rare compared to normal instances, which is why they are called anomalies.

Having more negative examples (normal cases) is actually beneficial because it helps you accurately model what normal looks like. The more data you have, the better you can understand the distribution of your normal data, leading to more accurate anomaly detection.

However, if anomalies make up a significant portion of your dataset, it shifts the problem from traditional anomaly detection to a classification problem. In such cases, you would use supervised learning techniques to classify the data into normal and anomalous categories.

Hope this helps! Feel free to ask if you need further assistance.

Negative examples are by definition not anomalies.

2 Likes

Thanks, @Alireza_Saei, I understood.

Regards.
Gus

You’re welcome, happy to help :raised_hands: