Data Splittting Strategy in Supervised ML

Dear Community,

since I have implemented a manual splittting strategy for my ML models I am getting much better results. Therefore i asked myself if this strategy is legit.

I can argue with literature and data that my target variable has a linear relation to one feature i. Therefore my strategy is to arrange the data based on the size of this one feature i and splitting the dataset in 100 subsets to afterwards assign 20% of each subset to be test data and the rest to be training data. The Validation is applied with k-fold cross validation on the test set.

What do you think about this approach? Thank you.

What is the purpose of splitting the dataset to 100 subsets? Are these subsets all the same size? If you’re just going to randomly pick 20% of each subset, why not just randomly pick from 20% of the original set?

Thank you for your answer.

The purpose of this approach is to avoid complete different ditributions of the target values between test- and training sets. By choosing random splits i receive really wild and deviate distribution which vary vastly in distribution and mean producing bad results.

Yes, in this approach I am splitting the data in, with exception of the last subset, in equally sized subsets.

Because we use this trend between the target and the specific feature i the distribution among test and train set are therefore more similar in comparison to randomly split the data. This is because we always assign 20% to be in the test set from every subset. So we use a certain trend whcih is changing slightly within each subsets.

In my opinion this approach could be done If one can argue that this trend between the target and feature i holds for other or unseen data too.

Cool, thanks for the response.

I see what the issue is now: it’s that randomly splitting may cause the train and test splits to have a different distribution than the original data. This could indeed be a problem for smaller datasets or imbalanced datasets.

I think what you are doing with subsets might work, but I won’t know until I get more details (may need to actually see the implementation/code).

However, the usual way to get around this problem is to use the “stratify” parameter when splitting the data using sklearn’s train_test_split.

If you pass your labels to the stratify param, then it will make sure the distribution of labels across train/test split is equal (ex. same percentage of positives/negatives across train/test). If you’re looking to also include feature “i” in the stratification process, see this post on how to pass two columns into stratify.

Thank you so much.

I didn’t trust the tool yet because I thought it splits really randomly.

Here is how the code works, where ‘number’ is the number of pieces to split the data in 1% pieces out of the total data size ( i implemented this here because i have separated data subsets from the comprehensive data set to run specific models) , ‘size’ is the amount of data points which correspond to the 1% subsets, and ‘test_size’ is the fraction to set the test set to 20%.

df_sorted = df.sort_values(by='r_dist')
number = int(len(df)*splitsize)
size = len(df_sorted) // number
parts = [df_sorted.iloc[i * size:(i + 1) * size] for i in range(number)]

if len(df_sorted) % size != 0:
    parts[-1] = pd.concat([parts[-1], df_sorted.iloc[number * size:]])

TRAIN_DATA = pd.DataFrame()
TEST_DATA = pd.DataFrame()

for part in parts:
    test_part_size = len(part) // test_size
    TEST_DATA = pd.concat([TEST_DATA, part.iloc[:test_part_size]])
    TRAIN_DATA = pd.concat([TRAIN_DATA, part.iloc[test_part_size:]])

I read the code. Although I’m not sure whether it actually runs correctly or not, I think I get the general logic behind it.

Assuming r_dist is your “target” or “feature i”, then what you are effectively trying to do is stratified splitting. I recommend you use the “stratify” parameter of train_test_split instead of doing it manually, since that’s already widely used and battle tested (so less likely to have bugs).

Are you working on a regression problem (where r_dist can be a continuous range of real numbers) or classification problem (where r_dist can be a limited number of different integers/categories)? For proper stratification, you should be instead be breaking the subsets based on possible r_dist values or bins, rather than just breaking them up by an arbitrary number (of 100 in your case).

1 Like

“I didn’t trust the tool yet because I thought it shared really randomly.”

You need to learn to trust these tools; start by reading the documentation and practice, practice, practice applying the technique to the data until you are comfortable.

Also, I suggest reading Andrew Ng’s book “Machine Learning Yearning”. Chapters 5, 6, 7 discuss strategies for splitting data. Your strategy should reflect on how many observations you have and the evaluation metric you are trying to improve with your model.

I had a similar curiosity around splitting data a few months ago, with a complex classification problem I was working on. My strategy for splitting data involved 70% for training, 10% validation, 20% for testing, with roughly 1,300 observations. I used an XGBoost Classifier and the fastML function for train_valid_test_split() function. The evaluation metric I was trying to improve was the F1 score. Which originally started at 20% and I worked my way up to 96% with precision at 89% on the test set.

You can view the code on my github space if curious.


1 Like

Okay sorry, let me clarify:

  • I am working on a regression problem with ExtraTrees
  • my target is “j”, it is one column of my dataframe with continuous range of real values
  • r_dist is one of my features, which shows a trend together with the target

And now I know why i tried my approach: I tried train_test_split from sklearn again but that doesn’t work because my target cannot be divided into categories. Can you adress this problem?

Seems a lot simpler if you toss all your data into one pool, and run several sessions using a purely random split. Average the weights you get from training on each session.

I don’t understand why this this anything to do with how you decide to split the dataset. Can you explain further?

I have a huge data set… training many scenarios, storing weights, and taking the average sounds computational expensive for me. Or do you mean something different?

My manual strategy, provided above, is improving my models predictions from 60% to 90% in a pearson correlation. This was the reason why i got unsure. I made this argument because I was considering whether my method might be banned because, for reasons I don’t yet know, it always delivers good results. I just wanted to make sure that I didn’t make a serious mistake

Ok that makes more sense now. I think there’s a bit of debugging to do here.

Basically what you’re doing by splitting into subsets is effectively binning (into 100 bins), and you’re applying stratified sampling to those bins. Overall, that approach does make sense. If you want to see how to bin and use stratify using sklearn tools, you can see this post.

I’m curious about the distribution of r_dist and j, though. You shouldn’t need stratified sampling unless those distributions are very skewed/multimodal. Can you plot the histograms of them and see how they look like? See this for how to do this. If they’re neither skewed/multimodal, then there might be some other problem with your data processing/model.

Also, in your other post, you mention that you have a lot of data? How much samples are you talking about?

Okay, I simply tried to use train_test_split without the stratified function but the result of the two distributions is almost perfectly similar and my models performance is very very good. Thank you! Sometimes I think the performance of the predictions is to good, or is this normal for some models?

The quality of the predictions depends entirely on the complexity of the data set and how well your model works.

Can you tell me what the math behind this ‘train_test_split’ function is? I can’t find it

Have you read this?

1 Like

Cool, glad that you’re able to get a decent model.

I think you didn’t need stratified sampling after all, since your data wasn’t particular skewed/multimodal.

I don’t have your code, but my guess is that whatever you did to split the train/test sets wasn’t random sampling.

For example, if you sort the data by any feature (ex. r_dist), and then you take the first 80% of samples as train, and the last 20% as test, then obviously the data distribution between train/test wouldn’t be correct.

What you need to do is randomly select 80% of the data into train, and the remainder into test.

If you want to implement something like this from scratch, you’d likely need to generate a random number for each sample, and then you can choose 80% of the largest randomly generated numbers into the train set, and the remainder in the test set. Or you could randomly shuffle the dataset, and then take the first 80% as train. Something like that. There are other ways as well, and you can probably google more on the subject if you’re interested.