Hello, everybody,
I just had a general question regarding the development set. How exactly to you pick it? I understand it must come from the same population distribution as your training set (unlike the potentially optional test set), but how exactly do you “draw” the development set out of the population that you get your training set from? Are both the development sets and training sets just random samples, and the development set is just a lot smaller? I’d appreciate any insight anyone could offer me on this.
Many thanks!
The simplest strategy is that you pool all your labelled input data into one set. Then you randomly shuffle it and select the subsets for training, dev and test. That gives you the highest chance that all three sets are statistically representative (“from the same distribution”). BTW I think you are misinterpreting what Prof Ng said if you got the impression that the test set is optional.
In terms of how to size the various subsets, Prof Ng discusses that in the video and gives you rules of thumb. It depends on the total amount of labelled data you have. If you have a relatively small aggregate dataset (< 10^5), then you typically use something like 60/20/20 or maybe 80/10/10 for training, dev, test. If you have relatively large datasets (> 10^6), then the dev and test sets can be smaller percentages. Please watch the lecture again for more details on the set sizes.
Thank you for your reply! It was quite helpful. I will rewatch the lecture with this information in mind.