I have seen some people confused about what is the difference between Train, Dev and Test sets. Correctly separating the dataset into Train, Dev (sometimes also called validation) and Test is critical to avoid future problems when training an ML model. In short:
Train set: sub-set of the whole dataset used for training your model. This sub-set is usually at least ~80% of the whole datasets (with big dataset near to 95 or even 99%).
Dev set: sub-set of the whole dataset used for calculating metrics in each epoch during training. This sub-set is usually way less than ~10% of the whole dataset, at least if the dataset is big enough.
Test set: sub-set of the whole dataset used for final check that the model is able to generalize well with data never seen before. This sub-set is usually as big as the Dev set. The Test set is never used in the training, except at the very end for final verification. The Test set must be as near as possible to the production data, but also to the Test set, so that it is easier to interpret the result. If the metrics show very different values in Dev and Test set it usually means that there a difference of distributions or overfitting.
For more details, please see the detailed explanation from the course:
Is this confusing for you? → Please write a comment below
I would like to add why a validation set is even necessary.
The fact that training multiple times with the same validation set might lead to overfitting on the validation set, that is, information leak.
In my personal experience, I use Dev set before and during training for “internal” testing. Sometimes you might want to adjust hyper-parameters prior to or during the training. For that purpose, it’s a good idea to use a separate set that is set aside from the training set. The reason for that is to give as unbiased an estimate as possible. You shouldn’t really use the training set to tune the hyper-parameters, then obviously it’ll be biased towards the training set. For the same reason, you shouldn’t really use the test set.
If your data is from the same population and you have a random sample from the same population then the train, dev and test error should be similar (withing some acceptable variation) for a single model?
From the videos it seems the emphasis is on the fact that your dev set may not be 1) from the same distribution as the training data, 2) The dev set is not a random sample.
I am really confused about this.
It seems the dev and test sets should be tested to see if they are from the same distribution first.
Hi William,
Keep in mind, the main purpose of keeping these separate sets (train, dev, test) is to make sure your model generalizes well.
You can always train your model to perform well on the training data and training data only but such a model isn’t useful if it doesn’t work on any other dataset. So that’s why you want to keep separate sets to test if works using the data the model isn’t trained on. In other words, you want to keep it from overfitting to the data.
That said, train, dev set, and test set should come from the same distribution. They should be randomly sampled but from the same distribution, so they still represent the same data.
Does that help?
What I know from my experience, I split data to train and test splits. The test split is kept away for testing the model at the very end.
The train split is used for fitting the model and hyper parameter tuning. This is known as cross validation function. This function shuffles the train data and divide it to say 3 parts. Two third of this data is used for fitting and one third is used for validating. The shuffle is done 2 more times and this procedure is repeated again to get an averaged validation results o different validation subsets.