Skewed datasets


After watching the optional lesson on skewed datasets, I was wondering if you could share more insights on the influence of the distribution of the dataset (including training, validation and test examples) on the quality of the learned model.
For instance, if one increases on purpose the number of examples of a rare class doesn’t that bias the model?
Shouldn’t the dataset be representative of what the model will encounter when used? How can this be diagnosed and mitigated?



1 Like

Hey @ljb1706,

I will try to break your post into points so that it be clear for you.
First Let’s understand what imbalanced dataset means

  1. Imbalanced Datasets:

    • When one class is significantly underrepresented compared to others, it is called an imbalanced dataset. For example, in a binary classification problem, one class may have far fewer examples than the other.

    • And also it can lead to biased models, as the model may perform poorly on the minority class because it doesn't have enough data to learn from.

Okay after we understand now what is imbalanced datasets means let’s see how this impact on model bias

  1. Impact on Model Bias:

    • Imbalanced datasets can bias the model towards the majority class, leading to high accuracy on the majority class but poor performance on the minority class.

    • The model may become overly conservative in predicting the minority class, leading to high false negatives and low true positives.

Till now it sounds intersting right? But you maybe wondering now how Diagnosis and Mitigation can be applied let’s address this point below:

  1. Diagnosis and Mitigation:
    To address the issues related to imbalanced datasets, you can consider the following strategies:

    • Resampling Techniques:

      • Oversampling : Increase the number of instances in the minority class by duplicating or generating synthetic data points.

      • Undersampling : Decrease the number of instances in the majority class by randomly removing data points.

    • Weighted Loss Functions:

      • Assign different weights to the classes in the loss function. Give higher weight to the minority class to penalize misclassifications more
    • Collect More Data:

      • If possible, collect more data for the minority class to balance the dataset naturally.

There are more methods to handle and mitigate these kinds of problems but i just mentioned some of them and i hope it’s clear for you now.


Thanks for the detailed answer. Collecting more data seems the most appropriate option.

1 Like

It is, indeed! However, in real-world scenarios, obtaining additional data may not always be straightforward. Therefore, it’s essential to explore alternative techniques.