Synthetic Data for Healthcare problem, binary classification

dan_herman · April 8, 2024, 11:47am

Does anyone have experience creating synthetic data to test model performance?

I would be grateful anyone could share their experience with effective tools and approach, before I start diving into a rabbit hole! Cheers!

balaji.ambresh · April 8, 2024, 11:56am

What’s the problem you’re working on?

dan_herman · April 8, 2024, 12:28pm

Healthcare problem, binary classification.

I have a dataset of 1,300 patients who received treatment for opioid use disorder. The treatment last for 6 months, so I have 6 months of clinical data, including drug tests, medication doses and surveys for self reported use. I’ve also created features for outcomes, including meeting predefined abstinence window, total negative tests and total consecutive negative tests. Abstinence is the primary measure of clinical benefit.

I trained an xgboost model with an F1 score of 96% and precision for 89%.

I have a very small test set, about 200 observations. I would be interested to see how the model performs on more test data.

Cheers

balaji.ambresh · April 8, 2024, 1:03pm

Please consider deep learning specialization to learn about model analysis and generating synthetic data. See the outline for courses 2, 3 (for model analysis) and 4 (for generating synthetic data).

Contact mentors of machine learning specialization / AI for medicine specialization for pointers.

Good luck.

dan_herman · April 8, 2024, 1:37pm

Hi Balaji,

I took the deep learning specialization last year. I can’t recall anything that covered generating synthetic data. I looked into the notebooks for those lessons you mentioned. They cover NLP and bidirectional neural networks. What specifically are you referring to?

I am looking for a more practical real world approach, not so much academic or theory based, if possible.

Cheers

balaji.ambresh · April 8, 2024, 2:18pm

Good to know that you took DLS. See course 4 week 2 assignment 2 lab for details on augmentations performed using a Sequential object.

Questions:

What’s the bayes measure on the metrics of interest?
Why is the test split ~15% of available data? Not that the split ratio is carved in stone but any of the split ratios recommended by Andrew as part of craating train / test sets when you have < 10K data points should give you more test data points.
Are the training and test sets coming from the same distribution?

It’s important to understand the feature distribution of train data to pin the method for synthetic data generation.

dan_herman · April 8, 2024, 2:37pm

I looked at course 4 week 2 assignment.

This random data generation is specific for time series and for the exercise in the notebook. This is not appropriate for the project I am working on.

A bit of a waste of time chasing this down.

From the cursory research I’ve done, I’m going to explore Ydata.ai. I also became aware of the Ydata github page which discusses practical approach and different models used for synthetic data creation.

balaji.ambresh · April 8, 2024, 2:45pm

Time series data is covered in 4th course of TensorFlow Developer Professional Certificate. Course 2 does cover CNN related augmentations in the labs.

Deep learning specialization covers data augmentation and model analysis.

Good luck with your exploration.

braintech · April 9, 2024, 8:44pm

Can you screenshot or share the dataset format. I’m thinking if Generative Adversarial Networks (GANs) or python fake library will be good to generate that.

dan_herman · April 9, 2024, 10:23pm

I think you’re on the right track, I do believe GANs are one interesting solution. Unfortunately, I don’t have time to research sequence models now! ;(

I used commercial off the shelf solution called ydata.ai

I created a dataset with 100,000 observations. That’s from the freemium model. If you want to add more observations and over 30 features, you have to pay for it.

The data did not perform well, I got additional 20% loss on precision for positive treatment outcomes. I saw the synthetic data used a very different distribution for some of the features. I believe that’s what lead to overfitting.

That was my first crack at synthetic data. I will keep investigating techniques to get better quality in the output.

Cheers

Topic		Replies	Views
AI Assisted Medical Diagnosis in Clinical Setting AI for Medical Diagnosis week-module-3	3	551	March 22, 2023
Synthetic data creation AI Discussions data-centric	6	227	May 2, 2025
Training on Synthetic... Testing on Fuel AI Discussions	1	41	May 16, 2023
Breast Cancer Prediction AI Discussions ai-discussions , careers , project	11	1718	February 19, 2024
Diabetic Retinopathy Dataset AI for Medical Prognosis week-module-1	1	530	July 13, 2022

Synthetic Data for Healthcare problem, binary classification

Related topics