Data augmentation cause data leakage?

Rajesh_Komaravelli · July 25, 2023, 11:43pm

In the course you have mentioned that the distortion we implement on modified data should be the type we see in test set .
Doesn’t this cause data leakage as we trying to modify data that look like test data?

rmwkwok · July 26, 2023, 2:15am

There are two situations we want to avoid:

Our label y appears in our features X
Our test data appears in our training data

The first situation is called data leakage. The second situation leads to over-estimation of our model’s performance to supposingly unseen data.

Data augmentation does not add label information to our features. Data augmentation does not copy test data to training data. Data augmentation does not copy training data to test data.

Data augmentation simply synthesize new data without targeting to doing any of the above.

Lastly, it is DESIRED for our training data to have a similar distribution as test data.

Consider that you want your model to recognize cats and dogs, then very naturally your test dataset will be full of photos of cats and dogs, right?

Now ask yourself this: will you train that model with a dataset of cups, chairs, and tables just so to avoid the model from seeing any photos that look like the test dataset? The answer is obviously no. In contrast, we want to make sure the training data looks like what we want to test the trained model on.

If your test dataset has upside-down cats and upside-down dogs, you are telling me that the model is supposed to be able to recognize upside-down cats and upside-down dogs. Then I will make sure my training dataset has upside-down cats and upside-down dogs, and if there is none, then I will augment some data by flipping over some of my existing photos.

Giving photos of cats and dogs to training the model does NOT cause data leakage. Embedding the ground truth label into the photo causes data leakage.

Cheers,
Raymond

Topic		Replies	Views
Week 2 Model Misclassifies Cats Convolutional Neural Networks in TensorFlow week-module-2	1	502	September 27, 2022
C2_W3_Video: Adding Data Advanced Learning Algorithms week-module-3	9	279	April 15, 2024
Data Augmentation is decreasing my accuracy? Convolutional Neural Networks week-module-2 , coursera-platform	5	28	March 14, 2025
Training and Testing on Different Distributions Structuring Machine Learning Projects coursera-platform	3	663	April 26, 2021
Artificial data synthesis vs Data augmentation Structuring Machine Learning Projects coursera-platform	2	611	January 21, 2023

Data augmentation cause data leakage?

Related topics