Balanced Train/Dev/Test

adesai · January 28, 2022, 7:13pm

Hi,

I was confused in the image below, why are we dividing the whole number splits by the percentages? For ex, he divides 21/60% to get 35% under the Random Train Split. Or am I just understanding this wrong?

balaji.ambresh · January 30, 2022, 10:11am

Assume that the 100 data points you have get split into 60 for training, 20 for dev set and 20 for test set.

Let’s now assume that you get 21 positive examples out of 60 examples in the training dataset, 2 positive examples out of 20 in dev set and 7 positive examples out of 20 in test set.

Training set has 35% positive examples i.e. 21 out of 60.
Similarly, we have 2% positive examples in dev set and 7% of positive examples in test set.

To iterate, there’s only 1 split. The numbers 21, 2, 7 are just assumed to be the distribution of positive examples in train / dev /test splits, since the split is done randomly.

He has highlighted the percentage of positive examples to show the imbalance in positive examples distribution across train / dev / test sets.

Topic		Replies	Views
Train_dev_test split doubt Structuring Machine Learning Projects coursera-platform	2	540	September 21, 2022
Week 1: train/dev/test split Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	531	December 19, 2022
Train/dev/test data proportion question about a course AI Discussions ai-discussions	1	62	March 4, 2024
Train/Val/Test balanced split Machine Learning in Production	4	626	May 21, 2021
Course 3 Week 2 Quiz Question Phrasing Structuring Machine Learning Projects coursera-platform	3	817	December 15, 2022

Balanced Train/Dev/Test

Related topics