How should I split the dataset to train/test/dev if dataset changes frequently?

parin · June 27, 2021, 4:27pm

Hi, I have a question splitting the dataset into train/dev/test in case the original dataset is changing everyday. This is a very common thing in building recommender system. Following is an example.

I have a dataset of user-product purchase and I am trying to predict the next purchase of a user. Each row in the dataset is of a pair (x, y) where x is a sequence of user’s past product purchases (IDs) and other information about user and Y is the next product ID. I can train a specific neural architecture that can be trained for this task.

The issue is depending upon which time you collect the data distribution can change as users keep placing new purchases.

On day1 i have different train/dev/test data then on day2 and so on (Unlike in vision tasks where images do not change over a period of time?)

What is the best advise in this scenario?

carloshvp · June 28, 2021, 7:15am

Hello @parin
If I understand the problem right, at a specific day, you have a dataset, next day you collect more data and therefore an updated dataset is available. I believe the best is that at the time of training you use your latest available dataset, including all previous actions by users. Then randomly split train-dev-test and train the model. Important is that the distributions between train-dev-test at any given training instance is not too different.
Next day of course you will have the updated dataset. I think you should re-train including the new data. Again here randomly split the train-dev-test subsets.

I hope this helps?

carloshvp · June 28, 2021, 7:16am

You will need to decide how often you want to re-train with new data. This could be every hour, every day, every week, every month…

parin · June 29, 2021, 8:39pm

Thanks @carloshvp for your answer! it makes perfect sense!

Topic		Replies	Views
Train_dev_test split doubt Structuring Machine Learning Projects	2	539	September 21, 2022
Week 1: train/dev/test split Improving Deep Neural Networks: Hyperparameter tun	5	528	December 19, 2022
Creating and randomizing training, dev, and test data sets AI Discussions	11	118	March 29, 2023
Data distribution for training-dev set Structuring Machine Learning Projects	2	551	December 29, 2022
Data Splittting Strategy in Supervised ML Supervised ML: Regression and Classification week-3	15	264	March 8, 2024

How should I split the dataset to train/test/dev if dataset changes frequently?

Related topics