Many outliers vs real data

Andriy_Fedorenko · June 6, 2023, 6:48pm

Hi team,
I have been trough the 1st week of the Unsupervised Learning, and few questions came up:
I have data set of [ID, x, y]. These are the traces of objects on the image having their IDs.
There are 60K of those IDs and I have situation when actual valid traces are few compare to the number outliers. you can think of 2-3% of useful data point. I need to find out a method to segregate outliers vs actual traces.
I started with identifying the appropriate features, and transform them to np.log to get Gaussian distribution, like here:

But the noise of the data is pretty big, for example Total Distance vs Sinuosity reveals some clusters but running K means or DBSCANN is not feasible as still a lot of noise there.
What path I should I take to difference the outliers vs real trace data?

gent.spah · June 7, 2023, 7:57am

Anomaly detection is used when you have a lot of good data and a few outliers. But you say you have only 2-3% good data and the rest outliers. How come such a thing occurs, I mean why do you have so many outliers? It would be better to have more good data to train an anomaly detection algorithm to detect outliers. But if you think that the good data you have can be a good representation of the gaussian distribution then you can use that gaussian distribution and what falls outside it be an anomaly.

Or given that you have so much imbalance you can train another type of algorithm a Neural network or maybe a regression type or maybe a tree… to detect a good vs. an outlier data point.

shanup · June 7, 2023, 11:19am

@Andriy_Fedorenko

I think in this case, we will need to first revisit how and on what basis we classify it under good data, if this cluster comprises of only 2-3% of the overall data.

Topic		Replies	Views
Anomaly algorithm - video difference Unsupervised Learning, Recommenders, Reinforcement week-1	6	28	July 10, 2024
Anomaly Detection vs Supervised Learning Unsupervised Learning, Recommenders, Reinforcement week-1	2	382	May 15, 2024
Anomaly detection using Gaussian vs isolation forest/SVM Unsupervised Learning, Recommenders, Reinforcement week-1	1	513	August 11, 2022
Difference between outlier and anomaly Unsupervised Learning, Recommenders, Reinforcement week-3	6	1900	February 26, 2023
Finding unusual events example: Why unlabeled data Unsupervised Learning, Recommenders, Reinforcement week-1	1	430	July 13, 2023

Many outliers vs real data

Related topics