C1 W2 Lab: Balance the dataset

Hi,

In Week 2 Course 1 in order to make a balanced set it has used below:

df_grouped_by = df.groupby([‘product_category’, ‘sentiment’])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

Here it picks 9(min value) items for each product_category ,
Is this the way we need to follow to balance the data set in real life too??? Or this given in the lab exercise just to make sure there are same number of classes for each product_category.

Are there any better way to make class balanced ? What if the data set is too huge and too skewed then how do we make it balanced??

Thanks,
Priyabrat

Hi @Priyabrat_Bishwal,

I was planning to engage in this discussion with you, but had faced some urgent matters at work. I’m sorry for the delay!

So, the way I see in real life you can pursue both alternatives (assuming you have data centricity in your soul :wink: ):

  1. balancing the data: go through this process will help to make sure you can use simple baseline models such as Naive Bayes (relative frequency of classes will affect such model). I see this approach as more rigorous.
  2. unbalanced: you can approach class imbalance in several ways being some of them: artificially generating more data (such as in CV: images rotation, mirroring, etc), train/dev/test set distributions (dev and test w/ same distributions and reflecting the data you expect to get in the future), orthogonalization, size of dev and test sets (depending on how much data you have in hand), etc.

I hope this helps! :slight_smile:

Cheers!

Hi Priyabrat_Bishwal,

Just to add to Raul’s answer, for artificially generating more data for an under-represented class, you could look at techniques like SMOTE. I have also seen some folks use GANs for it but I wouldn’t recommend GANs unless you have a lot of compute resource available at your disposal, training a GAN can be very computationally expensive.

Also, a lot of the algorithms in scikit-learn have a way to specify class weights via the ’ class_weight’ parameter to address class-imbalance. For example class_weight parameter in : RandomForestClassifier

1 Like

.size() gives the size of df_grouped_by. Print it to see. If you also add the min() to this line, it gets the minimum of the size which is 9 corresponding to Layering group. In order to have a balanced dataset, you have to go with the minimum, why? When you only have 9 samples of one category, here Layering and -1 for sentiment, you can’t increase the number of your samples unless you use ML techniques like SMOTE to artificially increase the sample numbers, but you CAN decrease the number of samples of other categories by choosing fewer of them. In fact, the least occurrence sets the threshold for you to have a balanced dataset.

Take a look at the output of these 2 lines and you will understand.

df[df['product_category']=='Layering'].shape 
df[(df['product_category']=='Layering') & (df['sentiment']==0)].shape