I have a question regarding the train/dev/test split. In principle, the distribution of the data in all sets must be similar (we can’t expect that model to work well for different set of datapoints). My idea is that as long as we can guarantee that the distributions of those three sets are the same with enough confidence, then we don’t have to discuss how much of the data must be dedicated for train/dev/test. Maybe there exists sound statistical framework under which the data for dev/test is sampled reasonably from the train set considering the requirements of:
-Dev/test sets reflecting the distribution of train set significantly well (related to sample distrubtions)
-We do not loose the confidence of the distribution of the train set
There are no hard and fast rules for splitting a dataset into train, dev, and test sets, but there are some guidelines that are commonly followed.
In general, the train set should be the largest of the three sets, and it is used to train the model.
If the dataset is “small”, say less than 100K records:
Dev and Test can be each between 10%-20% of the total.
If the dataset is “large”, say over a million records:
Dev and Test can be defined in the 10s of thousands, more than in a percentage of the total records. For instance, one a 1 million set, we could define 20K for dev and 20K for test.
It is important to randomly shuffle the data before splitting it into the three sets, to ensure that the data is representative of the overall distribution and to prevent any biases in the split.
Thank you for your response. Yes, it makes sense why it is the standard practice. I was thinking more from the statistical pov. But when we have more data, I think it is reasonable to have 20k datapoints only, as it is already a high number that can capture the distribution well. So we do not have to go through unnecessary step of sampling in a ‘statistically sound’ way.
Exactly! at that point 20K datapoints or so should be enough. And of course, it is all relative. May be on 10 million datapoints you may want to use 50K or 100K. Each case should have some criteria applied to determine proper numbers.
One further clarification here: we always need to make sure that whatever sampling method we use to subset the data into training, dev and test sets is “statistically sound” and not biased. You need to understand how your data is organized. E.g. the MNIST dataset is sorted by the digit number labels, so if you just take the last 10% for the dev and test sets, then the training set has no 9 entries and the dev and test set have only 9s. That’s not going to work out very well. Juan mentioned in his replies that at the very least you need to randomly shuffle the dataset before doing any subsetting.
To extend a bit the case of multiple categories, make sure that each set contains a representative sample of examples from each class. This can be achieved through stratified sampling. This involves partitioning the dataset based on the class labels, so that each set contains roughly the same proportion of examples from each class.
In any case, it is a good idea to check the distribution of examples within each set after the split is done, to make sure that each set is representative of the overall dataset. If the distribution is significantly different between the sets, it could impact the model’s performance on the test set.