Smaller splits to compare and infer properties of model variants

newboadki · September 29, 2023, 1:45pm

Consider that train, validation and test splits are performed and respected, that is, the model is tested on unseen data, hyper parameter adjustments are performed on the validation set, and training only occurs on the training sets.

Say I have a 500_000 samples data set. And that given the limitations of the systems I can train in, it is only practical to use 100_000 of those samples. Even training with the training set of those 100_000 takes considerably long to be practical.

Is it a bad practice to perform smaller splits for example of 25_000 samples and train and validate with that amount to speed up development and compare different model architectures, and iterate. Again, train-val-test splits are respected and not mixed.

And would the assumptions I make (for example Model A 15% more accurate than model B, 5% faster) hold (or have any value at all) once I increase the amount of data I train with? In other words, is this a feasible way of working?

TMosh · September 29, 2023, 8:39pm

It is not a bad practice. You’re essentially doing mini-batches.

newboadki · October 2, 2023, 4:31am

Thank you for your answer.