How does imbalanced data influence training

Victoria_Schroeder · December 20, 2023, 3:25pm

Hi

As already described in my previous post. I am training my own model , which is detecting outliers in the data to trigger an alarm. Hence, one can imagine that the data is highly imbalanced (0.09 % are alarms).

I have tried to downsample the number of “no alarms” and upsample the number of “alarms”. During train , dev and test the model works ok. These data samples are taken from the more balanced data (after downsampling and/or upsampling). However, when i go into production the model overestimates the number of alarms extremely.

Is this a form of overfitting? and hence a variance error?

How does one evaluate the performance of a model with imbalanced data , since the accuracy is not really telling ?

How can one make the training, dev and test dataset more balanced without running into problems during production , where alarms are much more rare ?

If , I don’t do any kind of down or upsampling the model only predicts “no alarm” in every instance.

In the course , imbalanced data is not really addressed. Is there a course that addresses how to deal with such data or could you give me some advice ?

Many thanks
Victoria

TMosh · December 20, 2023, 6:35pm

Anomaly detection typically uses different methods. Training for accuracy or minimum cost doesn’t work for detecting outliers.

It’s covered in the MLS course.

Victoria_Schroeder · December 20, 2023, 8:06pm

Thank you for your advice. What about things like fraud detection . Usually fraud is also a very rare event . Would that also fall into the anomaly detection ?

TMosh · December 20, 2023, 8:21pm

It depends on how often fraud happens.

paulinpaloalto · December 21, 2023, 1:08am

Hi, Victoria.

As Tom says, Anomaly Detection is a different type of problem. I remember that Prof Ng spent a week on it back in the original Stanford Machine Learning course. It is also covered in a more modern treatment in the new MLS specialization. I have not taken that one yet and it’s been too long since I took the Stanford course to really remember much. But I just did a quick search on YouTube for “andew ng anomaly detection” and it looks like it includes his lectures on the subject from both the original Stanford ML course and the newer MLS version. It might be worth taking a look at one of those lectures to see how he addresses the issue of anomalies being relatively rare and how that affects your training data and your choice of loss function and optimization techniques.

Victoria_Schroeder · December 21, 2023, 9:14am

thank you Paul and Tom. I have found a course from Prof Ng on unsupervised learning including anomaly detection.

From your experience, how does the data need to be distributed to use a deep neural net ? does it need to be around 50 / 50 when predicting two classes ?

paulinpaloalto · December 21, 2023, 5:01pm

It’s a great question, but I’m sorry that I don’t have any real experience applying these techniques. All I know is what Prof Ng says in the various courses and even there how much I remember is a function of how long ago it was that I took the course.

Just generally speaking, the classifiers that we’ve seen here in DLS (either binary or multiclass) using deep neural nets seem to require relatively balanced training sets. But the case of using DNNs for anomaly detection may be different. I do remember taking a look at Course 1 of the AI for Medicine Specialization (AI4M) a while ago and they did address the balance issues because the anomaly detection idea also applies when trying to detect disease in medical images. That is a supervised learning problem in that the images are labelled and those cases are using Neural Nets, so maybe you’d find some relevant ideas there as well as in the Anomaly Detection lectures you found.

I am happy to help if I can, but have to be honest when I don’t know the answer. It would be great to hear how your research proceeds and what you learn. Being able to continue learning is a big part of what makes participating in the forum discussions rewarding.

TMosh · December 21, 2023, 5:08pm

As is often the case in machine learning, the answer is “well, that depends…”.

If you’re doing binary classification, a 50/50 split is ideal - but not strictly necessary.

But if you’re doing multiple classes (like the classic handwritten digits example), the dataset is trained using one-vs-all, and there the split for each class (compared against all others) is 10/90. Given a large enough dataset, this seems to work fine.

I’d guess that for binary classification, you could go maybe as far as 30/70 without worries. Again, it depends on the number of examples used for training.

Topic		Replies	Views
Class imbalance problem AI Discussions	4	85	May 14, 2021
On building a Neural Network from sratch AI Discussions	5	116	May 12, 2022
Anomaly algorithm - video difference Unsupervised Learning, Recommenders, Reinforcement week-1	6	29	July 10, 2024
Skewed datasets Advanced Learning Algorithms week-3	3	328	November 5, 2023
Tossing out bad examples: Real world production data distribution AI Discussions ai-discussions , data-centric	8	421	August 11, 2021

How does imbalanced data influence training

Related topics