Pre Training Bias Metrics in case of multiple features

The Pretraining Bias Metrics are calculated for a single feature. In the C1W2 assignment, we have balanced the product_category feature. In a scenario where there are multiple features, do we balance feature by feature? If yes, then how to ensure that balancing one feature does not affect an already balanced feature?

Thank you!

Hi @DNVamsiRamana! Thanks for your question.

You have touched a crucial point when facing CI, specially managing unstructured data such as text sentiment analysis. You may do data augmentation in the lesser representative classes to improve model performance; and with unstructured data such measures tend to work well independently within the different classes even when you’re not touching other more representative classes.

If you want to be very diligent with your iterations, take that road after careful error analysis within the different classes against your baseline(s). That way you can better identify opportunities. Check out this article about baseline and this one about error analysis that #machine-learning-engineering-for-production refers.

It is worth to mention that within data augmentation we can also use GANs.

Thank you for the reply and resources that was really insightful but I think I did not ask my query clearly. Let me elaborate. If there are two features A and B(A is not balanced but B is balanced). CI is calculated for a facet of a feature. If CI for facet d of feature A is <0 it implies that there are more samples of facet d than other facets in feature A. Now, if we balance feature A by removing a few samples from facets which have CI<0 to ensure all facets have the same number of samples. Then this is can affect the balance of facets of feature B, no?

Thank you!

I’m glad those I had shared here somehow helpful. Regarding your follow-up:

Yes, it may affect balancing in feature B. That is why we should avoid removing examples in order to tackle CI. You may refer to the techniques I have shared to tackle, point out that iterating and conducting error analysis may highly guide you to which treatment to undertake further.

I hope it helps and happy learning!