Finding unusual events example: Why unlabeled data

Hi dear community. As I am going through the Unsupervised Learning, Week 1 - Anomaly detection class, some examples are shown as how to use this algorithm to detect anomalies, assuming over unlabelled data (as it is the main topic of this course)

2 examples shown are:
-Anomaly detection on flight engines
-Suspicious user activities on some internet page/portal

Is there any specific reasons to this not to be Supervised Learning? I believe you can easily tag malfunctioning engines and correlate their features.
As for the user activities, you may not be able to tag suspicious behaviour “on a first iteration”, but as you are going forward, collecting more data, couldn’t you?

Wouldn’t be this more accurate and easier to validate IRL (not talking about cost function here)

Am I missing something?

Thanks a lot!

Hello @sv77,

For the sake of introducing unsupervised learning, of course we don’t attempt to make it a supervised problem.

Also, whether or not an unlabelled dataset can be converted to a labelled dataset is not up to anyone but the dataset/project owner to tell.

One point of consideration is the cost - how much additional value does labeling them give us? You said it is more accurate, then we need to ask how much more accurate? Does that improvement compensate the additional manpower we spend on labeling?

Note that as you said, they are unusual events. If they constitute only 0.01% of all events, would correctly separating out that 0.01% of events make a huge difference to the trained model?

For example, in the lecture, we use Gaussian distribution to model, and by “training the model”, we are computing two numbers: the means and the standard deviations. Without the labels, we compute the numbers over all samples. With the labels, we compute them over only normal samples. The question is, how much would the computed numbers change? If our sample size is very large, then the numbers are not likely to differ by too much, which means that, even though the numbers are more accurate, they are not too different. As a result, we might have spent a lot of manpower on labeling the data, but end up only deliver a slightly more accurate model. The question is then whether it is worthwhile to do so.

Above is just a specific scanerio. I believe you can give other examples where labeling is more possible, however, at the end of the day, we need to ask the data owners, why don’t they label their data, and then we can listen to the concern they have. If you can come up with a feasible labeling plan, some convincing estimates of cost and return, and other promising arguments that address their concerns, then I don’t see why we don’t move forward.