Help with class imbalance in mammogram dataset when fine-tuning EfficientNet?

I’m working on a binary classification problem for breast cancer detection using mammogram images. I’m fine-tuning a pretrained EfficientNet model.

Dataset distribution:

Train set:

  • Benign: 1569

  • Malignant: 803

Validation set:

  • Benign: 448

  • Malignant: 66

Test set:

  • Benign: 208

  • Malignant: 128

The dataset is moderately imbalanced overall (~2.2:1), but the validation set is heavily skewed toward the majority class.

What I’ve tried / considered

Data augmentation for minority class

My main concerns:

What is the most reliable approach for handling this type of imbalance in medical image classification?

  • Should the validation set be rebalanced, or should it reflect real-world distribution?

  • Which metrics should I prioritize for this problem (recall, AUC, F1, etc.)?

My main goal is to maximize malignant class detection (recall), since false negatives are critical in this case.

What I’m looking for:

Best practices or recommended training strategy for handling this kind of imbalance when fine-tuning EfficientNet on medical imaging data.

You’re thinking about the problem in the right way.

Data augmentation is always a good practice, especially for image data, and it can help with imbalance by enriching the minority class.

For medical classification problems like this, the key is not to rely on accuracy. You should focus on precision, recall, and F1-score, especially recall for the malignant class since false negatives are critical.

This topic is also covered in the Machine Learning Specialization—there’s a unit on evaluation metrics for imbalanced datasets that’s worth revisiting.

Overall: prioritize the right metrics, not just model performance.

Have you seen this?

This excerpt in the weighted initialization section of the tutorial linked above seems particularly relevant to the OP’s objective…

The default threshold of t = 50% corresponds to equal costs of false negatives and false positives. In the case of fraud detection, however, you would likely associate higher costs to false negatives than to false positives.

Related to weighting is the idea of weighting directly in the loss function. It seems to have been explored in several medical imaging studies that you can find on the interweb. For example