Can someone explain the code for balancing the data in c1 w2 lab please?

In C1 W2 lab ’ Detect data bias with Amazon SageMaker Clarify’, there’re two lines to balance the dataset:

df_grouped_by = df.groupby(['product_category', 'sentiment'])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

Can someone explain why the second line can balance the data please? Many thanks.

You are basically dropping until you reach the level of the least represented. Now we have a balanced df (because we have equal representation). Hope this is clear (and if not, please let me know) :slight_smile: thanks for posting your question!

I have the same question. Can you explain a little more on what is the index it tries to reindex? Is it product_category and sentiment? So basically it works out by dropping the rows until all groups reaches the number of the group with min. count. I am not sure if i understand it correctly.

It is not exactly dropping the rows. df.sample() takes a random sample from the df, the size of which is either a fraction or integer. Now, inside sample() there is “df_grouped_by.size().min()” which tells us the sample size is a big as the smallest group - this is where the magic happens. That ensures that each group will be the same size effectively balancing the dataset. I’m a little rusty on using lambda on a groupby object, but I’m 99% sure that it cycles through the groups.

Does that help? Basically instead of dropping data, we are randomly selecting rows from each group, the number of which equals the size of the smallest group.

1 Like