Training model on simulated data

What is the recommended approach to split simulated data into train and test without violating any of the ML principles like data leakage, overfitting etc.?

The simulated data is mimicking the real-world under different scenarios. Each scenario is replicated a large number of times (1000+). A scenario has one or more probability distributions with its own unique parameters. A replication is a random draw from a corresponding probability distribution with a unique seed.

I want to train the model under all possible scenarios so it performs well on the simulated test set as well as the real-world data when its available. I am thinking of splitting the data at the seed level i.e. for each scenario, 80% of data having common seed goes into test set and the remaining 20% into test set. The data generated using the same seed is not shared between train & test. Any suggestions?

Hi @raghavendrasri
welcome to the community
just for understand your question better
you say

I am thinking of splitting the data at the seed level i.e. for each scenario, 80% of data having common seed goes into test set and the remaining 20% into test set.

Do you mean

I am thinking of splitting the data at the seed level i.e. for each scenario, 80% of data having common seed goes into train set and the remaining 20% into test set.

?

About the splitting size (80% and 20%) it sounds good but please consider that the proportion depends on how much data you have in each scenario,
In the past, when the datasets were small (10000 samples or less), a usual splitting size was 70% for train set and 30% for the test set.
Nowadays that datasets are really huge (100000 samples, maybe more) it’s common practice to assign a large portion of the data to the train set (90% or more) and the remaining portion to the test set.

Take care that the simulated data is really representative of the real world: an upside down car does not exist in the real world.
Hope this can help
Regards