Train_dev_test split doubt

Doubt over the split for train/dev/test set:

Let’s say I’ve 2 datasets 1 training dataset (100000 examples) and 1 test dataset (10000 examples).
How should we split the training set to get the dev set? Is 20000 records good enough for dev set?

  • In one of the video, it was mentioned that the dev set and test set should be of same distributions.
    So how are we going to achieve the same in this case, considering dev set is coming from a different set and test set is coming from different set.

20000 examples for the dev set should be ok I would say.

In order to have the sets have the same distribution you should merge them and shuffle them good enough so all the data is mixed up and then divide for train/dev/test.

Thanks for the clarification @gent.spah

1 Like