Dataset split using Unique patients IDs

Good morning everyone,

Excuse my simple question, as I am new to the coding world,

I am trying to work on the original dataset CXR-8 on my own.
But I don’t know how to split the dataset into train/valid/test based on unique patients ID
Can anyone help me with a proper code. Please
Many thanks in advance.

you can use scikit-learn for test train splitting.

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

You can see inputs for this API here in this link, or you can use ImageGenerator() from tensorflow.

so what about the validation set?
how to create it, shall I add it to the code?
train, valid, test = train_valid_test_split or something like this

As said by @me_sajied, the train_test_split in scikit-learn will work.

If you need train, test and also valid dataset, split the test data into two by setting appropriate test_size in the next line.

1 Like

In a more simpler way,

First split the data into two, train and test_sample - Now you have 2 data.

Keep train and split the test_sample again
In the next line split the test_sample into two, test and valid. By this you will get three datasets. Mention test_size accordingly.

1 Like

I really thank you @bharathikannan For the clarification, that’s really helpful :smiley:

1 Like

One approach might be that if your unique patient id is an integer and can also be used as an index of a list. The other approach would be to use Dataframe and use conditions to match your unique patient id to split your data. Pandas is very useful in that regards.

Many Thanks @sbansal793 for you reply.
Can you please elaborate more?
the patient IDs are integers yes they are serial integer numbers from 1 to 1335
and each one id is repeated on average 3 times.
I used the train_test_split from scikit_learn
and then removed the overlapping patients between the train-test-valid sets,
but it ended up losing many samples> 1000.
so if you please tell me more about using conditions through pandas, will be so thankful.

Can you post a screenshot of your dataset? It will give me a clear picture. Are the id used as the name of the image of the chest x rays? How are you loading the dataset?

yes of course, I highlighted the patient ID , and Image index.
the images has the same name as the image index.

I am really thankful for your support and help.

1 Like

You first create an array form 1 to 1335 using:
Shuffle the array
Split the array
train_datafram= df[df[‘Patient ID’].isin(train_id)]
Similarly for test and Val

1 Like

Cool that you’re trying to work on this in addition to the course work :slight_smile:
Don’t want to add more confusion if a solution has been found XD
But one more option would be to use sklearn’s group split sklearn.model_selection.GroupShuffleSplit — scikit-learn 1.0 documentation
What it does is split the data according to the defined groups (in this case patient ID) such that one group is not separated into different splits.
Have a look here for a working example on how to use this class (and other options on how to split your data):
3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.0 documentation


Thank you @sbansal793,
I just tried it and indeed it worked for me. :muscle: :smiley:
It’s a really nice code. I understood it easily. :smiley:

Many thanks for your help and support.

1 Like

Many thanks @margaridacosta, I am coming from a medical background _pharmacist :smiley:
But I do love ML and it’s applications in healthcare, That’s why I am here now :grinning:

I tried using GroupShuffleSplit, it seems very elegant solution, but I am unable to do it.

so my whole dataframe is df
and i want to split it by Patient ID
I tried this code
from sklearn.model_selection import GroupShuffleSplit
gss = GroupShuffleSplit(n-split = 5 , test_size = 0.2 , random_state = 0)

for train , test in gss.split( df , groups = ‘Patient ID’):
print ("%s s " (train,test))

and I got error

If you can help me please, will be totally appreciated.

Hello @Youstina.Ghoris

Can you explain why you want to split your dataset based on patiend id?

Since there are many unique patient ids, there will be many small datasets. And we need to store all of them. The best way to store is by using a key value pairs in a dictionary.

Try dict(tuple(df.groupby('patient_id'))) for splitting based on patient id.

Use len(pd.unique(df['patient_id'])) to see how many unique patient ids are there and use df.groupby('patient_id').count() to see how many unique rows are there for individual patient ids and see whether you really want to split the dataset based on this.

1 Like

Will dictionary dict(tuple(df.groupby(‘patient_id’))) return list of dictionary?

1 Like

@sbansal793 No, It will just return a single dictionary where each key will be patient id and values will be corresponding rows with that patient id.

1 Like

Hello @bharathikannan,
Many thanks for your reply. I really learned a lot from those code line.

I want to do this to avoid the data leakage that could occur if we have images of the same patients in the test & train datasets. Because they are of the same patient it’s most likely that the model will detect them easily, so overoptimistic model.

1 Like