Resampling to address dataset imbalance

paulinpaloalto · May 21, 2022, 9:33pm

In the lecture Resampling to Achieve Balanced Classes, the instructor seems to suggest that we can simply duplicate training set entries for the under-represented classes. That can’t possibly be the whole answer, can it? Literally duplicating data may balance the classes, but it does not add new data and I would think the training can’t really learn anything new from the additional copies of data that is already present. Maybe if we applied data augmentation to the under-represented classes then you’d at least not be adding exact duplicates of existing data.

I’m hoping more will be said on this topic as we continue through Week 1.

isaac.casm · May 29, 2022, 9:13am

There are other ways to balance the dataset, weighting by labels, creating synthetic images, etc… But duplicating is not that bad and it is pretty common to use. If you think about it, every iteration modifies the gradients slightly, so repeating the data is not terrible. In fact, the idea of using multiple epochs is exactly that. Moreover, gradient descent was thought to keep using all the data (and therefore the same) at each iteration, stochastic gradient descent is just an approximation because of memory issues. Anyhow, it is common to add some augmentation to the images and in that way, they will be a bit different every time.

One more thing, a better way to achieve the same without explicitly duplicating the data is to create a generator that uniformly selects data from the different classes.

Hope it helps.

Topic		Replies	Views
Class imbalance problem AI Discussions	4	84	May 14, 2021
C2_W1_Lab02_CoffeeRoasting_TF Advanced Learning Algorithms week-1	2	512	August 14, 2023
Sampling strategy in case of imbalanced data AI Discussions ai-discussions	2	153	March 18, 2024
About balancing classes on classification Advanced Learning Algorithms week-4	1	472	March 15, 2023
C2 W1 : why tile/copy data before training Advanced Learning Algorithms week-1	3	582	December 21, 2022

Resampling to address dataset imbalance

Related topics