Sampling strategy in case of imbalanced data

Felix_Enriquez · March 14, 2024, 7:47pm

From the course AI for Medical Diagnosis, in Sampling video, it is explained that validation and test sets should be balanced 50-50 cases of both cases 0 or 1, so that the performance of the model can be assessed. But one question arise test set is suppossed to reflect reality, so it should be imbalanced meanwhile train set should be balanced to help the model to make a distinction between classes, is that approach correct?

Deepti_Prasad · March 14, 2024, 8:45pm

Hello @Felix_Enriquez

Can you the timestamp or video link where it’s mentioned what you are stating because as far as I know test and validation set having equal number of cases or ratio between the split is correct but including equal 50 cases of both 0 and 1 or I should say having the disease or not having the disease is not mandatory, only features of the dataset should match like in case of chest xray, all the x ray should be pertaining chest examination features relative to desired model analysis example pneumothorax or COPD or pleural effusion.

Regards
DP

altafkhan · March 18, 2024, 10:05am

Firstly, it’s important to avoid relying solely on Accuracy as a metric when dealing with imbalanced datasets. Accuracy can give a distorted view of a model’s performance in such cases. Instead, metrics like Precision, Recall, or the F1-score provide a more nuanced understanding of its effectiveness. Often, a balance between Precision and Recall is sought after, but the ideal balance depends on the specific application.

Regarding your query, while it’s beneficial to test your model against a distribution that mirrors reality, it’s crucial to ensure adequate representation of the minority class within the test set. The samples for the minority class in the test set should encompass a diversity comparable to what was present during training and validation. Balancing this representation with the natural occurrence of classes can pose a challenge, leading to test sets that may not always reflect the real-world distribution of classes.

Topic		Replies	Views
Class imbalance problem AI Discussions	4	85	May 14, 2021
Train/Val/Test balanced split Machine Learning in Production	4	626	May 21, 2021
Resampling to address dataset imbalance AI for Medical Diagnosis week-1	1	551	May 29, 2022
Balanced Train/Dev/Test Machine Learning in Production	1	546	January 30, 2022
About balancing classes on classification Advanced Learning Algorithms week-4	1	473	March 15, 2023

Sampling strategy in case of imbalanced data

Related topics