C2_W3_Video: Adding Data

jaejun02 · April 11, 2024, 2:47pm

Hello, I believe in the lecture, it was mentioned that the data created through data augmentation should be similar to the test set.

However, as far as I know, we should never really look at the test set before we finish training our model. However, to make data that are similar to test set, we need to look at it, right?

Can anyone help me understand this further?

Thanks.

paulinpaloalto · April 11, 2024, 3:05pm

What they mean by saying that you “don’t look at the test set” just means that your trained model has not been used on the test set until you finish the training and are ready to perform the final evaluation of how well the model performs. It’s fine for you to look at the test data earlier in the process to understand its properties so that you can figure out the right strategy for data augmentation. If you are doing data augmentation, that means you haven’t run your training yet or you ran it and it didn’t work well enough on the cross validation data, so you need to get more training data.

jaejun02 · April 11, 2024, 3:15pm

Thank you for the reply! However, I would like to ask if you perform data augmentation based on test set, and then train your model on it, since your model have learned data similar to the test set, wouldn’t that end up in an optimistic test result?

paulinpaloalto · April 11, 2024, 3:25pm

Yes, that might be a concern, but the point is that augmented data is not identical to the base data you start from to do the augmentation: it’s just similar in some way, depending on what kind of augmentation techniques you use.

If you are worried about that effect, you could do the augmentation on your full data set and then “redivide” all the data into the subsets randomly, so that the augmented data is fairly distributed among all three sets (train, cross validation/dev and test).

Also note that you can make it fair by starting your training over from scratch, not from the weights previously learned. The point is that you’re adding at least part of the augmented data to the training set, right? If you are just resetting everything and starting over, then the model has not “seen” any of the data before from a logical perspective. You’ve seen it, but the real point is to make the training fair.

TMosh · April 11, 2024, 4:25pm

What exactly does “based on test set” mean?

You’re never going to use the test set for training - only as a final check of how well your completed system works.

rmwkwok · April 12, 2024, 1:17am

Hello @jaejun02,

[edit: I was quoting DLS lectures in this reply. For free access to them, check this reply below]

I don’t think collecting or augmenting new data with reference to the test set is really an issue.

In the upcoming Course 3 (for example, the “Addressing Data Mismatch” video) , we are going to see that if the test data has a different distribution from the training data, we can address that mismatch.

I would also like to mention “production data” which is the data seen by the model in production time.

We want our train/dev/test data to all match with production data, or dev/test at least. The key is, to match, we need to look. In fact, in C3 W2 “Error Analysis”, it will give examples of what to look for.

Cheers,
Raymond

jaejun02 · April 12, 2024, 2:08pm

Hello, thanks for the reply!

After reading your’s and other’s reply, I still kind of think that referring to the test set and augmenting your data such that it is similar to the test set still poses the problem of contamination. Rather, wouldn’t it be better to refer to dev set when augmenting, then the leave test set untouched, such that it will give more accurate (non-optimistic) generalization score? Because if you train your model on data that are similar to test set than your model naturally gets to perform well on test set, right?

Or does this statement from the lecture: “But if to the extent that this isn’t that representative of what you see in the test set because you don’t often get images like this in the test set is actually going to be less helpful.” means that you just have to try applying a data augmentation technique that seems realistic (that is - appearing in the real world) not just random augmentation technique?

Thanks!

TMosh · April 12, 2024, 7:54pm

Augmenting the data set is a low-cost way to make your labeled data set larger. For images, such methods as translation, rotation, and mirroring can allow the model to generalize better and give more robust results, without the cost of collecting a huge amount of additional examples to capture that variance.

Those sorts of augmentation generally do not create false data in the test set. So in some cases you might augment the entire data set before splitting it into training/validation/test sets.

But first you have to be careful that the augmentation you use for a given problem is proper for the test set.

If it’s impossible for a given augmentation to ever occur in the completed system (which the test set represents), then you have to question whether that augmentation would be appropriate for the training set either.

rmwkwok · April 15, 2024, 9:10pm

Hello @jaejun02,

Let me share how I think about this train/dev/test thing, then I will go into your questions.

A machine Learning model is task-specific. Given that the test set should reflect the real world data (the task), we can say that the ML model is specific to the task defined by the test set, therefore, pleasing the test set should be the primary objective and not a problem.

However, at training, we should keep the model off the test set because a high variance model can “remember” (overfit to) training samples.

To emphasize, the cause is “model can overfit to samples”, the effect is “we don’t train with the test samples”.

The line here is “keep the test samples from the training process”. Does augmenting new training samples with reference to the test set cross the line? No, because no augmented sample is a replicate of any test sample. In other words, even if the model overfits to the augmented samples, it does not remember any test sample at all and thus won’t gain benefit in the test score.

Given the cause & effect and the line drawn, how can we please the test set? We prepare either a training or a dev set that shares the same distribution of data as the test set, or we prepare both. In either way, we make sure no sample is shared between the test set and the train/dev set, so no “remembering” of test sample by the model can occur.

Now, your questions:

If the dev set has the same distribution as the test set, then either referring to the dev or the test set makes no difference.

Even if we refer to the test set, we don’t make copies of test samples into the training/dev set, and in this sense, the test set is still untouched. (we have difference in the meaning of touch)

I recommend this lecture that says dev/test set should distribute in the same way.

We can find many similar say to this:

Here the keyword is “unseen data”. So, our line is drawn correctly - “keep the test samples from the training process”, because we can make sure the generalization score is evaluated with unseen data.

That’s true, but this is also how we please the test set, which is the purpose of the model. ML is task-specific, and it is specific to the task defined by the test set. Therefore, it only makes sense for us to refer to the test set in some way, right? Can you imagine we know nothing about the test data (real world data)?

You said appearing in the real world, and I say appearing in the test set. However, the test set should reflect the real world, so, by extension, I agree with you!

Cheers,
Raymond

rmwkwok · April 15, 2024, 9:21pm

By the way, @jaejun02, I just realized that you were asking questions about the MLS but, in my last reply, I was quoting lectures from the DLS.

So, I would like to share two ways you can have free access to them:

enroll and audit DLS’s course 3 ← this is up to date.
check out this playlist

Even though they are DLS course 3 lectures, they don’t really require knowledge from course 1 & 2, so you might choose to watch them as supplementary for your MLS journey.

Cheers,
Raymond

Topic		Replies	Views
Does Data Augmentation apply only to train data? Machine Learning in Production	2	728	July 12, 2021
Is it useful to augment images in the validation set? Convolutional Neural Networks in TensorFlow week-module-2	7	1099	January 6, 2022
Data augmentation for devset Convolutional Neural Networks in TensorFlow week-module-2	1	365	September 5, 2023
New 1000 images after model development (train/dev/test), where to add? Structuring Machine Learning Projects coursera-platform	12	737	July 5, 2023
Why would we augment validation data Convolutional Neural Networks in TensorFlow week-module-2	1	526	July 16, 2022

C2_W3_Video: Adding Data

Related topics