How does imbalanced data influence training


As already described in my previous post. I am training my own model , which is detecting outliers in the data to trigger an alarm. Hence, one can imagine that the data is highly imbalanced (0.09 % are alarms).

I have tried to downsample the number of “no alarms” and upsample the number of “alarms”. During train , dev and test the model works ok. These data samples are taken from the more balanced data (after downsampling and/or upsampling). However, when i go into production the model overestimates the number of alarms extremely.

Is this a form of overfitting? and hence a variance error?

How does one evaluate the performance of a model with imbalanced data , since the accuracy is not really telling ?

How can one make the training, dev and test dataset more balanced without running into problems during production , where alarms are much more rare ?

If , I don’t do any kind of down or upsampling the model only predicts “no alarm” in every instance.

In the course , imbalanced data is not really addressed. Is there a course that addresses how to deal with such data or could you give me some advice ?

Many thanks

Anomaly detection typically uses different methods. Training for accuracy or minimum cost doesn’t work for detecting outliers.

It’s covered in the MLS course.

Thank you for your advice. What about things like fraud detection . Usually fraud is also a very rare event . Would that also fall into the anomaly detection ?

It depends on how often fraud happens.

Hi, Victoria.

As Tom says, Anomaly Detection is a different type of problem. I remember that Prof Ng spent a week on it back in the original Stanford Machine Learning course. It is also covered in a more modern treatment in the new MLS specialization. I have not taken that one yet and it’s been too long since I took the Stanford course to really remember much. But I just did a quick search on YouTube for “andew ng anomaly detection” and it looks like it includes his lectures on the subject from both the original Stanford ML course and the newer MLS version. It might be worth taking a look at one of those lectures to see how he addresses the issue of anomalies being relatively rare and how that affects your training data and your choice of loss function and optimization techniques.

thank you Paul and Tom. I have found a course from Prof Ng on unsupervised learning including anomaly detection.

From your experience, how does the data need to be distributed to use a deep neural net ? does it need to be around 50 / 50 when predicting two classes ?

It’s a great question, but I’m sorry that I don’t have any real experience applying these techniques. All I know is what Prof Ng says in the various courses and even there how much I remember is a function of how long ago it was that I took the course. :grinning:

Just generally speaking, the classifiers that we’ve seen here in DLS (either binary or multiclass) using deep neural nets seem to require relatively balanced training sets. But the case of using DNNs for anomaly detection may be different. I do remember taking a look at Course 1 of the AI for Medicine Specialization (AI4M) a while ago and they did address the balance issues because the anomaly detection idea also applies when trying to detect disease in medical images. That is a supervised learning problem in that the images are labelled and those cases are using Neural Nets, so maybe you’d find some relevant ideas there as well as in the Anomaly Detection lectures you found.

I am happy to help if I can, but have to be honest when I don’t know the answer. It would be great to hear how your research proceeds and what you learn. Being able to continue learning is a big part of what makes participating in the forum discussions rewarding.

As is often the case in machine learning, the answer is “well, that depends…”.

If you’re doing binary classification, a 50/50 split is ideal - but not strictly necessary.

But if you’re doing multiple classes (like the classic handwritten digits example), the dataset is trained using one-vs-all, and there the split for each class (compared against all others) is 10/90. Given a large enough dataset, this seems to work fine.

I’d guess that for binary classification, you could go maybe as far as 30/70 without worries. Again, it depends on the number of examples used for training.