Trouble Understanding Class Imbalance (CI)

Hi @huanvo, @pdarling,

To add, in a more conceptual understanding, CI is typical Machine Learning challenge faced when developing and deploying algorithms. Having also a data centricity approach to training, we want to enable our algorithms to generalize well to add value to the real world.
Thus, it is a must to assess the data as well as address such issues during training.

Take for instance a typical cancer prediction test where you have ~99% of the labels negative. If you don’t address this C.I. during training, you algorithm in production may have 99% accuracy by only predicting negative outcome. That also applies to multiple classes, for instance in images segmentation to computer vision, you may find that most of the images in your dataset are of a particular class, such as “background/ typical environment” (from satellites pictures), or “empty road” (when shooting a trip in self driving cars).

I hope that also helps!