since I have implemented a manual splittting strategy for my ML models I am getting much better results. Therefore i asked myself if this strategy is legit.

I can argue with literature and data that my target variable has a linear relation to one feature i. Therefore my strategy is to arrange the data based on the size of this one feature i and splitting the dataset in 100 subsets to afterwards assign 20% of each subset to be test data and the rest to be training data. The Validation is applied with k-fold cross validation on the test set.

What is the purpose of splitting the dataset to 100 subsets? Are these subsets all the same size? If youâre just going to randomly pick 20% of each subset, why not just randomly pick from 20% of the original set?

The purpose of this approach is to avoid complete different ditributions of the target values between test- and training sets. By choosing random splits i receive really wild and deviate distribution which vary vastly in distribution and mean producing bad results.

Yes, in this approach I am splitting the data in, with exception of the last subset, in equally sized subsets.

Because we use this trend between the target and the specific feature i the distribution among test and train set are therefore more similar in comparison to randomly split the data. This is because we always assign 20% to be in the test set from every subset. So we use a certain trend whcih is changing slightly within each subsets.

In my opinion this approach could be done If one can argue that this trend between the target and feature i holds for other or unseen data too.

I see what the issue is now: itâs that randomly splitting may cause the train and test splits to have a different distribution than the original data. This could indeed be a problem for smaller datasets or imbalanced datasets.

I think what you are doing with subsets might work, but I wonât know until I get more details (may need to actually see the implementation/code).

However, the usual way to get around this problem is to use the âstratifyâ parameter when splitting the data using sklearnâs train_test_split.

If you pass your labels to the stratify param, then it will make sure the distribution of labels across train/test split is equal (ex. same percentage of positives/negatives across train/test). If youâre looking to also include feature âiâ in the stratification process, see this post on how to pass two columns into stratify.

I didnât trust the tool yet because I thought it splits really randomly.

Here is how the code works, where ânumberâ is the number of pieces to split the data in 1% pieces out of the total data size ( i implemented this here because i have separated data subsets from the comprehensive data set to run specific models) , âsizeâ is the amount of data points which correspond to the 1% subsets, and âtest_sizeâ is the fraction to set the test set to 20%.

df_sorted = df.sort_values(by='r_dist')
number = int(len(df)*splitsize)
size = len(df_sorted) // number
parts = [df_sorted.iloc[i * size:(i + 1) * size] for i in range(number)]
if len(df_sorted) % size != 0:
parts[-1] = pd.concat([parts[-1], df_sorted.iloc[number * size:]])
TRAIN_DATA = pd.DataFrame()
TEST_DATA = pd.DataFrame()
for part in parts:
test_part_size = len(part) // test_size
TEST_DATA = pd.concat([TEST_DATA, part.iloc[:test_part_size]])
TRAIN_DATA = pd.concat([TRAIN_DATA, part.iloc[test_part_size:]])

I read the code. Although Iâm not sure whether it actually runs correctly or not, I think I get the general logic behind it.

Assuming r_dist is your âtargetâ or âfeature iâ, then what you are effectively trying to do is stratified splitting. I recommend you use the âstratifyâ parameter of train_test_split instead of doing it manually, since thatâs already widely used and battle tested (so less likely to have bugs).

Are you working on a regression problem (where r_dist can be a continuous range of real numbers) or classification problem (where r_dist can be a limited number of different integers/categories)? For proper stratification, you should be instead be breaking the subsets based on possible r_dist values or bins, rather than just breaking them up by an arbitrary number (of 100 in your case).

âI didnât trust the tool yet because I thought it shared really randomly.â

You need to learn to trust these tools; start by reading the documentation and practice, practice, practice applying the technique to the data until you are comfortable.

Also, I suggest reading Andrew Ngâs book âMachine Learning Yearningâ. Chapters 5, 6, 7 discuss strategies for splitting data. Your strategy should reflect on how many observations you have and the evaluation metric you are trying to improve with your model.

I had a similar curiosity around splitting data a few months ago, with a complex classification problem I was working on. My strategy for splitting data involved 70% for training, 10% validation, 20% for testing, with roughly 1,300 observations. I used an XGBoost Classifier and the fastML function for train_valid_test_split() function. The evaluation metric I was trying to improve was the F1 score. Which originally started at 20% and I worked my way up to 96% with precision at 89% on the test set.

You can view the code on my github space if curious.

I am working on a regression problem with ExtraTrees

my target is âjâ, it is one column of my dataframe with continuous range of real values

r_dist is one of my features, which shows a trend together with the target

And now I know why i tried my approach: I tried train_test_split from sklearn again but that doesnât work because my target cannot be divided into categories. Can you adress this problem?

Seems a lot simpler if you toss all your data into one pool, and run several sessions using a purely random split. Average the weights you get from training on each session.

I donât understand why this this anything to do with how you decide to split the dataset. Can you explain further?

I have a huge data setâŚ training many scenarios, storing weights, and taking the average sounds computational expensive for me. Or do you mean something different?

My manual strategy, provided above, is improving my models predictions from 60% to 90% in a pearson correlation. This was the reason why i got unsure. I made this argument because I was considering whether my method might be banned because, for reasons I donât yet know, it always delivers good results. I just wanted to make sure that I didnât make a serious mistake

Ok that makes more sense now. I think thereâs a bit of debugging to do here.

Basically what youâre doing by splitting into subsets is effectively binning (into 100 bins), and youâre applying stratified sampling to those bins. Overall, that approach does make sense. If you want to see how to bin and use stratify using sklearn tools, you can see this post.

Iâm curious about the distribution of r_dist and j, though. You shouldnât need stratified sampling unless those distributions are very skewed/multimodal. Can you plot the histograms of them and see how they look like? See this for how to do this. If theyâre neither skewed/multimodal, then there might be some other problem with your data processing/model.

Also, in your other post, you mention that you have a lot of data? How much samples are you talking about?

Okay, I simply tried to use train_test_split without the stratified function but the result of the two distributions is almost perfectly similar and my models performance is very very good. Thank you! Sometimes I think the performance of the predictions is to good, or is this normal for some models?

Cool, glad that youâre able to get a decent model.

I think you didnât need stratified sampling after all, since your data wasnât particular skewed/multimodal.

I donât have your code, but my guess is that whatever you did to split the train/test sets wasnât random sampling.

For example, if you sort the data by any feature (ex. r_dist), and then you take the first 80% of samples as train, and the last 20% as test, then obviously the data distribution between train/test wouldnât be correct.

What you need to do is randomly select 80% of the data into train, and the remainder into test.

If you want to implement something like this from scratch, youâd likely need to generate a random number for each sample, and then you can choose 80% of the largest randomly generated numbers into the train set, and the remainder in the test set. Or you could randomly shuffle the dataset, and then take the first 80% as train. Something like that. There are other ways as well, and you can probably google more on the subject if youâre interested.