How to improve F1 score in an imbalanced dataset?

endrit.dosti · September 12, 2021, 10:50am

Hi,

I wanted to ask some suggestions about a binary classification task that I am facing in my work. The biggest problem comes because I have an imbalanced dataset. More specifically, the dataset has 9 features, and a total of 13000 data points. The imbalance ratio between the classes is a little worse than 1:10. The issue that I am having is the low values of F1 score.

So far, I have tried the following:

Test different types of models (e.g. logistic, ridge, SVM, XGBoost, random forest etc)
I used the random undersampling/oversampling algorithms contained in the imbalanced-learn library.
Augment the minority class only in the training set with some more data points from another dataset, and check the performance in the test set (which contains points only from my original dataset); i.e. I changed the training distribution slightly
I tried changing the coefficients of the class, i.e. put higher weights to the minority class points, so that the model would “understand” that those points are important.
I tried to use outlier detection techniques (isolation forest, one-class SVM, local outlier factor etc).

The best performance that I have managed to get is around 80% accuracy and 20% F1 score using regularized SVM. I am happy with the accuracy, however the F1 score is still really low. Can someone suggest more things which I can try?

Thanks in advance! I am very grateful for the help
Best wishes,
Endrit

Topic		Replies	Views
Steps after finding the F1 score is bad for skewed data Machine Learning in Production	2	543	January 5, 2023
Course1: week2: Skewed datastes Machine Learning in Production	2	596	May 20, 2021
Class imbalance problem AI Discussions	4	112	May 14, 2021
Learner assigns negative label to all examples Neural Networks and Deep Learning coursera-platform	1	503	May 3, 2022
Metric for classification assignments Week2 Convolutional Neural Networks coursera-platform	3	523	August 20, 2022

How to improve F1 score in an imbalanced dataset?

Related topics