Week 2 - Balance the Dataset

I am not so sure what the below is doing?
df_grouped_by = df.groupby([‘product_category’, ‘sentiment’])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

Also I’m not so sure what the meaning of balanced data. It looks simply it calculates 100/the_number_of_categories. What’s the point of balancing data?

Hi @stakehara,

To better understand the code, please go through these tutorials.
group by
reset index
sample

If you still have doubts about how the code is converting unbalanced data to balanced data, let me know.

“Also I’m not so sure what the meaning of balanced data. It looks simply it calculates 100/the_number_of_categories. What’s the point of balancing data?”

Let’s say you are given unbalanced data of 1000 samples with 900 samples positive and 100 samples negative, which need to be classified into two categories. To classify the data, you are using a knn classifier. Since there are very few negative examples(sparse), Most of the negative sample’s nearest neighbors may be positive samples. Due to the imbalance of the dataset, your model is ending up misclassifying the negative samples as positive samples. ML models are Data-driven, to design a classifier approximate to the optimal classifier you must remove the bias caused due to the imbalance of the dataset.

Best Regards,
A. Sriharsha

1 Like

Hi @stakehara, to add what @sriharsha0806 explained you very well, also refer to this thread of messages about Class Imbalance.

I hope you find it helpful.

Cheers!

Looks that the calculation populates random data with minimum number in the all categories.

Yeah thank you for the explanation. Now I understood what the exercise tries to do.

I appreciate it!

2 Likes