Skewed datasets

ljb1706 · November 5, 2023, 8:00am

Hi,

After watching the optional lesson on skewed datasets, I was wondering if you could share more insights on the influence of the distribution of the dataset (including training, validation and test examples) on the quality of the learned model.
For instance, if one increases on purpose the number of examples of a rare class doesn’t that bias the model?
Shouldn’t the dataset be representative of what the model will encounter when used? How can this be diagnosed and mitigated?

Thanks,

Laurent.

Jamal022 · November 5, 2023, 5:49pm

Hey @ljb1706,

I will try to break your post into points so that it be clear for you.
First Let’s understand what imbalanced dataset means

Imbalanced Datasets:
- When one class is significantly underrepresented compared to others, it is called an imbalanced dataset. For example, in a binary classification problem, one class may have far fewer examples than the other.
- And also it can lead to biased models, as the model may perform poorly on the minority class because it doesn't have enough data to learn from.

Okay after we understand now what is imbalanced datasets means let’s see how this impact on model bias

Impact on Model Bias:
- Imbalanced datasets can bias the model towards the majority class, leading to high accuracy on the majority class but poor performance on the minority class.
- The model may become overly conservative in predicting the minority class, leading to high false negatives and low true positives.

Till now it sounds intersting right? But you maybe wondering now how Diagnosis and Mitigation can be applied let’s address this point below:

Diagnosis and Mitigation:
To address the issues related to imbalanced datasets, you can consider the following strategies:
- Resampling Techniques:
  - Oversampling : Increase the number of instances in the minority class by duplicating or generating synthetic data points.
  - Undersampling : Decrease the number of instances in the majority class by randomly removing data points.
- Weighted Loss Functions:
  - Assign different weights to the classes in the loss function. Give higher weight to the minority class to penalize misclassifications more
- Collect More Data:
  - If possible, collect more data for the minority class to balance the dataset naturally.

There are more methods to handle and mitigate these kinds of problems but i just mentioned some of them and i hope it’s clear for you now.

Regards,
Jamal

ljb1706 · November 5, 2023, 7:55pm

Thanks for the detailed answer. Collecting more data seems the most appropriate option.

Jamal022 · November 5, 2023, 8:00pm

It is, indeed! However, in real-world scenarios, obtaining additional data may not always be straightforward. Therefore, it’s essential to explore alternative techniques.

Regards,
Jamal

Topic		Replies	Views
Skewed and imbalanced datasets AI Discussions ai-question	2	137	March 31, 2024
Training set label distribution AI Discussions ai-discussions , data-centric	2	69	January 3, 2022
Sampling strategy in case of imbalanced data AI Discussions ai-discussions	2	157	March 18, 2024
Resampling to address dataset imbalance AI for Medical Diagnosis week-module-1	1	551	May 29, 2022
About balancing classes on classification Advanced Learning Algorithms week-module-4	1	475	March 15, 2023

Skewed datasets

Related topics