Balancing the dataset- minimum size

While balancing the data, the minimum size taken is 9 which actually refers to the sentiment==0. This has resulted in drastically reducing the dataset from 22626 to 486. However, concern is with sentiment==1 and the minimum size under sentiment==1 is 78. In the bias-config, the desired outcome is specified for sentiment==1 for facet ‘product category’, but while balancing it is ignored. The right approach for balancing should be to take minimum for sentiment==1. Please clarify.
Also, assuming this approach, please specify the code to be used for balancing(taking minimum size for sentiment==1).


In relation to above, could you clarify what the argument label_values_or_threshold means in bias_config = clarify.BiasConfig(

Hi Aroonima,

According to the SageMaker documentation, here is the definition of label_values_or_threshold:
List of label value(s) or threshold to indicate positive outcome used for bias metrics. The appropriate threshold depends on the problem type:

  • Binary: The list has one positive value.
  • Categorical:The list has one or more (but not all) categories which are the positive values.
  • Regression: The list should include one threshold that defines the exclusive lower bound of positive values.
    Processing — sagemaker 2.101.1 documentation

If we choose 78 as the sample size, there would not be enough reviews in some categories to sample with.

In the following line,
df_balanced = df_grouped_by.apply(lambda x: x.sample(sample_size).reset_index(drop=True))

we can change the sample_size to the value we want for balancing. However, the program will report an error when the value is greater than the minimum value, which is 9 in this case.

Thanks, @XinghaoZong

So, since the dataset is biased towards 1 across product categories, we take the minimum 9 so that all the labels are balanced. Besides the solution provided, I believe that other methods can be tried as well, outside the area of the graded notebook.