I am not so sure what the below is doing?
df_grouped_by = df.groupby([‘product_category’, ‘sentiment’])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
Also I’m not so sure what the meaning of balanced data. It looks simply it calculates 100/the_number_of_categories. What’s the point of balancing data?
If you still have doubts about how the code is converting unbalanced data to balanced data, let me know.
“Also I’m not so sure what the meaning of balanced data. It looks simply it calculates 100/the_number_of_categories. What’s the point of balancing data?”
Let’s say you are given unbalanced data of 1000 samples with 900 samples positive and 100 samples negative, which need to be classified into two categories. To classify the data, you are using a knn classifier. Since there are very few negative examples(sparse), Most of the negative sample’s nearest neighbors may be positive samples. Due to the imbalance of the dataset, your model is ending up misclassifying the negative samples as positive samples. ML models are Data-driven, to design a classifier approximate to the optimal classifier you must remove the bias caused due to the imbalance of the dataset.