Resampling to address dataset imbalance

In the lecture Resampling to Achieve Balanced Classes, the instructor seems to suggest that we can simply duplicate training set entries for the under-represented classes. That can’t possibly be the whole answer, can it? Literally duplicating data may balance the classes, but it does not add new data and I would think the training can’t really learn anything new from the additional copies of data that is already present. Maybe if we applied data augmentation to the under-represented classes then you’d at least not be adding exact duplicates of existing data.

I’m hoping more will be said on this topic as we continue through Week 1.


There are other ways to balance the dataset, weighting by labels, creating synthetic images, etc… But duplicating is not that bad :slight_smile: and it is pretty common to use. If you think about it, every iteration modifies the gradients slightly, so repeating the data is not terrible. In fact, the idea of using multiple epochs is exactly that. Moreover, gradient descent was thought to keep using all the data (and therefore the same) at each iteration, stochastic gradient descent is just an approximation because of memory issues. Anyhow, it is common to add some augmentation to the images and in that way, they will be a bit different every time.

One more thing, a better way to achieve the same without explicitly duplicating the data is to create a generator that uniformly selects data from the different classes.

Hope it helps.