Synthetic Data for Healthcare problem, binary classification

Does anyone have experience creating synthetic data to test model performance?

I would be grateful anyone could share their experience with effective tools and approach, before I start diving into a rabbit hole! Cheers! :v:

What’s the problem you’re working on?

Healthcare problem, binary classification.

I have a dataset of 1,300 patients who received treatment for opioid use disorder. The treatment last for 6 months, so I have 6 months of clinical data, including drug tests, medication doses and surveys for self reported use. I’ve also created features for outcomes, including meeting predefined abstinence window, total negative tests and total consecutive negative tests. Abstinence is the primary measure of clinical benefit.

I trained an xgboost model with an F1 score of 96% and precision for 89%.

I have a very small test set, about 200 observations. I would be interested to see how the model performs on more test data.

Cheers

Please consider deep learning specialization to learn about model analysis and generating synthetic data. See the outline for courses 2, 3 (for model analysis) and 4 (for generating synthetic data).

Contact mentors of machine learning specialization / AI for medicine specialization for pointers.

Good luck.

Hi Balaji,

I took the deep learning specialization last year. I can’t recall anything that covered generating synthetic data. I looked into the notebooks for those lessons you mentioned. They cover NLP and bidirectional neural networks. What specifically are you referring to?

I am looking for a more practical real world approach, not so much academic or theory based, if possible.

Cheers

Good to know that you took DLS. See course 4 week 2 assignment 2 lab for details on augmentations performed using a Sequential object.

Questions:

  1. What’s the bayes measure on the metrics of interest?
  2. Why is the test split ~15% of available data? Not that the split ratio is carved in stone but any of the split ratios recommended by Andrew as part of craating train / test sets when you have < 10K data points should give you more test data points.
  3. Are the training and test sets coming from the same distribution?

It’s important to understand the feature distribution of train data to pin the method for synthetic data generation.

I looked at course 4 week 2 assignment.

This random data generation is specific for time series and for the exercise in the notebook. This is not appropriate for the project I am working on.

A bit of a waste of time chasing this down.

From the cursory research I’ve done, I’m going to explore Ydata.ai. I also became aware of the Ydata github page which discusses practical approach and different models used for synthetic data creation.

Time series data is covered in 4th course of TensorFlow Developer Professional Certificate. Course 2 does cover CNN related augmentations in the labs.

Deep learning specialization covers data augmentation and model analysis.

Good luck with your exploration.

Can you screenshot or share the dataset format. I’m thinking if Generative Adversarial Networks (GANs) or python fake library will be good to generate that.

1 Like

I think you’re on the right track, I do believe GANs are one interesting solution. Unfortunately, I don’t have time to research sequence models now! ;(

I used commercial off the shelf solution called ydata.ai

I created a dataset with 100,000 observations. That’s from the freemium model. If you want to add more observations and over 30 features, you have to pay for it.

The data did not perform well, I got additional 20% loss on precision for positive treatment outcomes. I saw the synthetic data used a very different distribution for some of the features. I believe that’s what lead to overfitting.

That was my first crack at synthetic data. I will keep investigating techniques to get better quality in the output.

Cheers