Hi, I have a question splitting the dataset into train/dev/test in case the original dataset is changing everyday. This is a very common thing in building recommender system. Following is an example.
I have a dataset of user-product purchase and I am trying to predict the next purchase of a user. Each row in the dataset is of a pair (x, y) where x is a sequence of user’s past product purchases (IDs) and other information about user and Y is the next product ID. I can train a specific neural architecture that can be trained for this task.
The issue is depending upon which time you collect the data distribution can change as users keep placing new purchases.
On day1 i have different train/dev/test data then on day2 and so on (Unlike in vision tasks where images do not change over a period of time?)
Hello @parin
If I understand the problem right, at a specific day, you have a dataset, next day you collect more data and therefore an updated dataset is available. I believe the best is that at the time of training you use your latest available dataset, including all previous actions by users. Then randomly split train-dev-test and train the model. Important is that the distributions between train-dev-test at any given training instance is not too different.
Next day of course you will have the updated dataset. I think you should re-train including the new data. Again here randomly split the train-dev-test subsets.