In the course you have mentioned that the distortion we implement on modified data should be the type we see in test set .
Doesn’t this cause data leakage as we trying to modify data that look like test data?
There are two situations we want to avoid:
- Our label y appears in our features X
- Our test data appears in our training data
The first situation is called data leakage. The second situation leads to over-estimation of our model’s performance to supposingly unseen data.
Data augmentation does not add label information to our features. Data augmentation does not copy test data to training data. Data augmentation does not copy training data to test data.
Data augmentation simply synthesize new data without targeting to doing any of the above.
Lastly, it is DESIRED for our training data to have a similar distribution as test data.
Consider that you want your model to recognize cats and dogs, then very naturally your test dataset will be full of photos of cats and dogs, right?
Now ask yourself this: will you train that model with a dataset of cups, chairs, and tables just so to avoid the model from seeing any photos that look like the test dataset? The answer is obviously no. In contrast, we want to make sure the training data looks like what we want to test the trained model on.
If your test dataset has upside-down cats and upside-down dogs, you are telling me that the model is supposed to be able to recognize upside-down cats and upside-down dogs. Then I will make sure my training dataset has upside-down cats and upside-down dogs, and if there is none, then I will augment some data by flipping over some of my existing photos.
Giving photos of cats and dogs to training the model does NOT cause data leakage. Embedding the ground truth label into the photo causes data leakage.
Cheers,
Raymond