Balanced Train/Dev/Test


I was confused in the image below, why are we dividing the whole number splits by the percentages? For ex, he divides 21/60% to get 35% under the Random Train Split. Or am I just understanding this wrong?

Assume that the 100 data points you have get split into 60 for training, 20 for dev set and 20 for test set.

Let’s now assume that you get 21 positive examples out of 60 examples in the training dataset, 2 positive examples out of 20 in dev set and 7 positive examples out of 20 in test set.

Training set has 35% positive examples i.e. 21 out of 60.
Similarly, we have 2% positive examples in dev set and 7% of positive examples in test set.

To iterate, there’s only 1 split. The numbers 21, 2, 7 are just assumed to be the distribution of positive examples in train / dev /test splits, since the split is done randomly.

He has highlighted the percentage of positive examples to show the imbalance in positive examples distribution across train / dev / test sets.