I noticed that Andrew’s advise is to split all 3 sets (Train, Validation and Test) in a balanced way. Whilst I agree with train and validation, I am not sure about Test.
Sure, ideally we would like to keep a balanced distribution all over, but shouldn’t the Test set reflect the most recent data in time, which might or might not reflect the Train/Val distribution? By randomizing the Test set as well, we might instead end up with old observations not entirely representative of the latest business.
In my understanding, the test set plays a role to measure how much our model generalizes on unseen user input. In that sense, it does not need to be the most recent data in time, and it is not overlapped with training data is enough.
If the test set and training set are collected arbitrarily and share nothing in common, then there is nothing we can do to improve the performance (generalization error) of the model, so usually, we assume they have the same data-generating process. In practice, we split the test set and training set from one source of data.
The same for the validation set, its purpose is to learn the hyperparameters in the model, the requirement is it is not overlapped with the training and test set.
Thank you very much for your answer, very interesting.
I believe that by considering the Test set as the most recent data available, it could be considered a way to check not only on how well the model generalizes but also implicitly an attempt to detect if any data drift is occurring.
Perhaps not the right step in the MLOps lifecycle to check for skews (yet).
I see what you mean here. I guess the Test data you mentioned should be the data we received from users after production deployment.
In my understanding, the train/val/test data set is the data we use to train our model before production deployment.
For the skews, I think it happens after production deployment, when we receive new unseen data from users, which may have different distribution from our data for training.
@mikaelmv I think you make a good point about data drift in an online situation. If you know the shift is a possibility you are concerned about, and there are a new set of the latest data available as a test set, I would use that as one of the factors of determining the need for retraining. @tranvinhcuong is right that when we talk about train/val/test data, it’s mostly talking about the development stage. But some models may be being trained on the fly, too so in that case, it would be best to use the latest.