Hi team,
I have been trough the 1st week of the Unsupervised Learning, and few questions came up:
I have data set of [ID, x, y]. These are the traces of objects on the image having their IDs.
There are 60K of those IDs and I have situation when actual valid traces are few compare to the number outliers. you can think of 2-3% of useful data point. I need to find out a method to segregate outliers vs actual traces.
I started with identifying the appropriate features, and transform them to np.log to get Gaussian distribution, like here:
But the noise of the data is pretty big, for example Total Distance vs Sinuosity reveals some clusters but running K means or DBSCANN is not feasible as still a lot of noise there.
What path I should I take to difference the outliers vs real trace data?
Anomaly detection is used when you have a lot of good data and a few outliers. But you say you have only 2-3% good data and the rest outliers. How come such a thing occurs, I mean why do you have so many outliers? It would be better to have more good data to train an anomaly detection algorithm to detect outliers. But if you think that the good data you have can be a good representation of the gaussian distribution then you can use that gaussian distribution and what falls outside it be an anomaly.
Or given that you have so much imbalance you can train another type of algorithm a Neural network or maybe a regression type or maybe a treeā¦ to detect a good vs. an outlier data point.
I think in this case, we will need to first revisit how and on what basis we classify it under good data, if this cluster comprises of only 2-3% of the overall data.