Train/Val/Test balanced split

mikaelmv · May 20, 2021, 5:02am

Hi everyone,

I noticed that Andrew’s advise is to split all 3 sets (Train, Validation and Test) in a balanced way. Whilst I agree with train and validation, I am not sure about Test.

Sure, ideally we would like to keep a balanced distribution all over, but shouldn’t the Test set reflect the most recent data in time, which might or might not reflect the Train/Val distribution? By randomizing the Test set as well, we might instead end up with old observations not entirely representative of the latest business.

I am curious to know your thoughts.

Thanks,

Mikael

tranvinhcuong · May 20, 2021, 8:06am

hi Mikael, welcome to the course!

In my understanding, the test set plays a role to measure how much our model generalizes on unseen user input. In that sense, it does not need to be the most recent data in time, and it is not overlapped with training data is enough.

If the test set and training set are collected arbitrarily and share nothing in common, then there is nothing we can do to improve the performance (generalization error) of the model, so usually, we assume they have the same data-generating process. In practice, we split the test set and training set from one source of data.

The same for the validation set, its purpose is to learn the hyperparameters in the model, the requirement is it is not overlapped with the training and test set.

Does that answer your question?

mikaelmv · May 20, 2021, 8:30am

Thank you very much for your answer, very interesting.

I believe that by considering the Test set as the most recent data available, it could be considered a way to check not only on how well the model generalizes but also implicitly an attempt to detect if any data drift is occurring.

Perhaps not the right step in the MLOps lifecycle to check for skews (yet).

Thanks,
Mikael

tranvinhcuong · May 20, 2021, 8:50am

I see what you mean here. I guess the Test data you mentioned should be the data we received from users after production deployment.

In my understanding, the train/val/test data set is the data we use to train our model before production deployment.

For the skews, I think it happens after production deployment, when we receive new unseen data from users, which may have different distribution from our data for training.

Hope that helps.
Cuong

suki · May 21, 2021, 8:29pm

@mikaelmv I think you make a good point about data drift in an online situation. If you know the shift is a possibility you are concerned about, and there are a new set of the latest data available as a test set, I would use that as one of the factors of determining the need for retraining.
@tranvinhcuong is right that when we talk about train/val/test data, it’s mostly talking about the development stage. But some models may be being trained on the fly, too so in that case, it would be best to use the latest.

Topic		Replies	Views
Week1 quizz: very confused about train/dev/test set and when to add new data to which set Structuring Machine Learning Projects week-1 , coursera-platform	2	394	February 1, 2024
Week 1: train/dev/test split Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	531	December 19, 2022
Questions about automatically choosing model Advanced Learning Algorithms week-3	5	356	August 31, 2023
Topic suggestion for MLOps course AI Discussions	1	47	November 7, 2022
Difference Train/Dev/Test sets Structuring Machine Learning Projects coursera-platform	7	1159	September 21, 2023

Train/Val/Test balanced split

Related topics