Sampling strategy in case of imbalanced data

From the course AI for Medical Diagnosis, in Sampling video, it is explained that validation and test sets should be balanced 50-50 cases of both cases 0 or 1, so that the performance of the model can be assessed. But one question arise test set is suppossed to reflect reality, so it should be imbalanced meanwhile train set should be balanced to help the model to make a distinction between classes, is that approach correct?

Hello @Felix_Enriquez

Can you the timestamp or video link where it’s mentioned what you are stating because as far as I know test and validation set having equal number of cases or ratio between the split is correct but including equal 50 cases of both 0 and 1 or I should say having the disease or not having the disease is not mandatory, only features of the dataset should match like in case of chest xray, all the x ray should be pertaining chest examination features relative to desired model analysis example pneumothorax or COPD or pleural effusion.


Firstly, it’s important to avoid relying solely on Accuracy as a metric when dealing with imbalanced datasets. Accuracy can give a distorted view of a model’s performance in such cases. Instead, metrics like Precision, Recall, or the F1-score provide a more nuanced understanding of its effectiveness. Often, a balance between Precision and Recall is sought after, but the ideal balance depends on the specific application.

Regarding your query, while it’s beneficial to test your model against a distribution that mirrors reality, it’s crucial to ensure adequate representation of the minority class within the test set. The samples for the minority class in the test set should encompass a diversity comparable to what was present during training and validation. Balancing this representation with the natural occurrence of classes can pose a challenge, leading to test sets that may not always reflect the real-world distribution of classes.