Dev and test distribution

Among the guidelines for structuring ML projects, one is to pick a dev and test set that have the same distribution.
But if I have a test set that is fixed, what is a good way to sample a dev set from the data so that it has the closest distribution to the test set ?

I think the simplest way to make sure that the two sets have similar distributions is to sample based on the target values.
It can be most easily demonstrate with a classification problem.
Say your train data has 100 data points with 50% class A, 25 % class B, 15% class C, 10% class D. And you want to set aside 10% of this training set as dev set.
In this scenario, you want to sample 10% from each class so in the end your dev set will be roughly 5 class A 2.5 class B, 1.5 class C. 1 class D.
This can be extended to regression models, too.
But good mixing and random sampling often do the job. You can always histogram the result to make sure the distributions are indeed similar.
Does that answer your question?

2 Likes

Hi @houzefa303

I would like to add a couple of hints to the great answer from @suki .

Ideally, the descriptive statistics for those features of your test samples are similar to those of the dev set.

Here a couple of examples:

  • One of your features is “sex” (categorical) and the dataset has 50-50 Male-Female. The dev and test set should also have a similar proportion
  • For any other continuous variable you can check the mean, standard deviation, etc., for example using the method describe() in Pandas

In general, any difference in the distributions of the train/dev/test datasets may have a negative impact in the model fitting. Distributions are measured with descriptive statistics or visually with histograms or Box-plots.

1 Like

Yes thank you.

I understand the idea for a multi class setting, but what if we are in a multilabel setting and the classes are not mutually exclusive ?
How can we still sample the classes to have the same distribution in a systematic way ?

@houzefa303
Please help me understand your question correctly.
Are you talking about the situation where you need multiple outputs per class?

For example: X → (y0, y1, y2,…yn)? (X gets mapped to y0,…,yn)

For that, you would need some other strategy. I believe there are many ways to do this but I would consider the following:
One way to check the similarity of multivariable/ multilabel data to check the distance between each point. For example, you can check the distance between the average samples from the test and dev set and make sure they are close to one another.
Does this help?

Yes I am refering to the situation where for one example in the training set, there can multiple labels associated for example for music classification it can be 1 for rock and 1 for electric-guitar simultaneously and then 0 for all other labels.
Since the labels are not independent from each other, it’s hard to sample the classes to respect some distribution (let’s say if I want 5 examples for rock and only 1 example for electric-guitar, it would be hard since having 1 for rock imply most of the time 1 for electric-guitar also).
So you would suggest to sample based on the average distance ?

@houzefa303
In splitting the training set and dev set, you don’t necessarily want to specify the distribution of either one of the groups. Though you do want to make sure those two are very similar in terms of distribution. I other words, you want both of the sets to cover as much of the support as possible.
If you do have a good random sampling method, like shuffling, permutation, etc, that should work ok. But to make sure, they are indeed similar, I can suggest using something like Frobenius norm (average distance) for multivariable target data like the ones you are describing.
Does it help? Let me know.

2 Likes

Yes it sounds right to me, I will try that out
Thank you for your help :blush: