How should I split the dataset to train/test/dev if dataset changes frequently?

Hi, I have a question splitting the dataset into train/dev/test in case the original dataset is changing everyday. This is a very common thing in building recommender system. Following is an example.

I have a dataset of user-product purchase and I am trying to predict the next purchase of a user. Each row in the dataset is of a pair (x, y) where x is a sequence of user’s past product purchases (IDs) and other information about user and Y is the next product ID. I can train a specific neural architecture that can be trained for this task.

The issue is depending upon which time you collect the data distribution can change as users keep placing new purchases.

On day1 i have different train/dev/test data then on day2 and so on (Unlike in vision tasks where images do not change over a period of time?)

What is the best advise in this scenario?

Hello @parin
If I understand the problem right, at a specific day, you have a dataset, next day you collect more data and therefore an updated dataset is available. I believe the best is that at the time of training you use your latest available dataset, including all previous actions by users. Then randomly split train-dev-test and train the model. Important is that the distributions between train-dev-test at any given training instance is not too different.
Next day of course you will have the updated dataset. I think you should re-train including the new data. Again here randomly split the train-dev-test subsets.

I hope this helps?

2 Likes

You will need to decide how often you want to re-train with new data. This could be every hour, every day, every week, every month…

2 Likes

Thanks @carloshvp for your answer! it makes perfect sense!

1 Like