Trouble Understanding Class Imbalance (CI)

Hello can someone explain how SageMaker Clarify computes the Class Imbalance? In the case of a balanced dataset data_config_balanced in the assignment, we sample the same number of reviews for each class, so I don’t know why the CI for each category is 0.89. (Since we sample 9 reviews for each class I thought it should be more like 9/27 = 0.3)

Hello @huanvo ,

I think the calculation goes like this (we’re talking about the balanced dataset):

  • For the label sentiment using label value(s)/threshold = 1

  • For the value Blouses

  1. Find the total sentiment positive in the dataset: 162
  2. Find the total sentiment positive for blouses: 9
  3. Apply the formula in the SageMaker Developer Guide

Let’s say,
na = 162 -9 = 153
nb = 9
CI = (na - nb) / (na + nb) = 0.88888

kind regards,

Thanks, it makes sense. However in the Sagemaker documentation it says a CI value close to 1 indicates that the data is highly imbalanced, but in this case we intentionally construct a balanced dataset…

Best,
Huan

Hello @huanvo,

you’re right, but the formula for Class Imbalance is being used for product_category Blouses Vs product_category the rest and in this sense it’s imbalanced but if you choose for instance product_category Blouses Vs product_category Lounge the Class Imbalance will be zero (perfectly balanced).

CI = (9 - 9) / (9+ 9) = 0

It’s just another way of looking at imbalance. The main ideia here is:

"

  • Class Imbalance (CI). Measures the imbalance in the number of members between different facet values. Answers the question, does a product_category have disproportionately more reviews than others? Values of CI will become equal for even distribution between facets. Here, different CI values show the existence of imbalance.
    "

and in the report all the product categories have the same value for CI.

hope it helps,

Ah I see. Thanks for the explanation. So I guess the key point is to identify the two facet values that are being compared. :slight_smile:

I had the same question and the explanation above helps understanding of how CI is calculated in this case. However I am not sure then why this metric is useful when there are multiple classes. Here we actually have a perfectly balanced data set, but CI tells us it isn’t?

Hi @huanvo, @pdarling,

To add, in a more conceptual understanding, CI is typical Machine Learning challenge faced when developing and deploying algorithms. Having also a data centricity approach to training, we want to enable our algorithms to generalize well to add value to the real world.
Thus, it is a must to assess the data as well as address such issues during training.

Take for instance a typical cancer prediction test where you have ~99% of the labels negative. If you don’t address this C.I. during training, you algorithm in production may have 99% accuracy by only predicting negative outcome. That also applies to multiple classes, for instance in images segmentation to computer vision, you may find that most of the images in your dataset are of a particular class, such as “background/ typical environment” (from satellites pictures), or “empty road” (when shooting a trip in self driving cars).

I hope that also helps!

Cheers!

In this case what we must do is to check if the value of CI is the same for all product categories. If it is, the data set is perfectly balanced.

regards,

3 Likes

Thanks for the replies. That does make sense. The point I am really making is that the CI metric doesn’t feel very intuitive to me as the significance of a particular result depends on a comparison to other results.

Hi,
In C1W2 assignment we see the CI metric calculated for balanced and imbalanced datasets. The documentation Measure Pretraining Bias on aws only mentions the binary classification case for the bias metric definitions. How are the bias metrics including CI and DPL defined for multi-class cases as in the dataset for this assignment? We just know that if CI is equal for all facets, it indicates the facets are balanced in the number of training samples in the dataset but we don’t know what this value of 0.888889 means. If for instance we look at another balanced dataset like movie reviews, we will have a different value for CI. Am I right to say that as long as the CIs are equal to each other, it means the facets are balanced? In other words, what does the value of CI for different balanced dataset mean?
Same question for DPL.