C2_W3_Video: Adding Data

Hello, I believe in the lecture, it was mentioned that the data created through data augmentation should be similar to the test set.

However, as far as I know, we should never really look at the test set before we finish training our model. However, to make data that are similar to test set, we need to look at it, right?

Can anyone help me understand this further?

Thanks.

2 Likes

What they mean by saying that you “don’t look at the test set” just means that your trained model has not been used on the test set until you finish the training and are ready to perform the final evaluation of how well the model performs. It’s fine for you to look at the test data earlier in the process to understand its properties so that you can figure out the right strategy for data augmentation. If you are doing data augmentation, that means you haven’t run your training yet or you ran it and it didn’t work well enough on the cross validation data, so you need to get more training data.

2 Likes

Thank you for the reply! However, I would like to ask if you perform data augmentation based on test set, and then train your model on it, since your model have learned data similar to the test set, wouldn’t that end up in an optimistic test result?

1 Like

Yes, that might be a concern, but the point is that augmented data is not identical to the base data you start from to do the augmentation: it’s just similar in some way, depending on what kind of augmentation techniques you use.

If you are worried about that effect, you could do the augmentation on your full data set and then “redivide” all the data into the subsets randomly, so that the augmented data is fairly distributed among all three sets (train, cross validation/dev and test).

Also note that you can make it fair by starting your training over from scratch, not from the weights previously learned. The point is that you’re adding at least part of the augmented data to the training set, right? If you are just resetting everything and starting over, then the model has not “seen” any of the data before from a logical perspective. You’ve seen it, but the real point is to make the training fair.

1 Like

What exactly does “based on test set” mean?

You’re never going to use the test set for training - only as a final check of how well your completed system works.

1 Like

Hello @jaejun02,

[edit: I was quoting DLS lectures in this reply. For free access to them, check this reply below]

I don’t think collecting or augmenting new data with reference to the test set is really an issue.

In the upcoming Course 3 (for example, the “Addressing Data Mismatch” video) , we are going to see that if the test data has a different distribution from the training data, we can address that mismatch.

I would also like to mention “production data” which is the data seen by the model in production time.

We want our train/dev/test data to all match with production data, or dev/test at least. The key is, to match, we need to look. In fact, in C3 W2 “Error Analysis”, it will give examples of what to look for.

Cheers,
Raymond

2 Likes

Hello, thanks for the reply!

After reading your’s and other’s reply, I still kind of think that referring to the test set and augmenting your data such that it is similar to the test set still poses the problem of contamination. Rather, wouldn’t it be better to refer to dev set when augmenting, then the leave test set untouched, such that it will give more accurate (non-optimistic) generalization score? Because if you train your model on data that are similar to test set than your model naturally gets to perform well on test set, right?

Or does this statement from the lecture: “But if to the extent that this isn’t that representative of what you see in the test set because you don’t often get images like this in the test set is actually going to be less helpful.” means that you just have to try applying a data augmentation technique that seems realistic (that is - appearing in the real world) not just random augmentation technique?

Thanks!

1 Like

Augmenting the data set is a low-cost way to make your labeled data set larger. For images, such methods as translation, rotation, and mirroring can allow the model to generalize better and give more robust results, without the cost of collecting a huge amount of additional examples to capture that variance.

Those sorts of augmentation generally do not create false data in the test set. So in some cases you might augment the entire data set before splitting it into training/validation/test sets.

But first you have to be careful that the augmentation you use for a given problem is proper for the test set.

If it’s impossible for a given augmentation to ever occur in the completed system (which the test set represents), then you have to question whether that augmentation would be appropriate for the training set either.

2 Likes

Hello @jaejun02,

Let me share how I think about this train/dev/test thing, then I will go into your questions.

A machine Learning model is task-specific. Given that the test set should reflect the real world data (the task), we can say that the ML model is specific to the task defined by the test set, therefore, pleasing the test set should be the primary objective and not a problem.

However, at training, we should keep the model off the test set because a high variance model can “remember” (overfit to) training samples.

To emphasize, the cause is “model can overfit to samples”, the effect is “we don’t train with the test samples”.

The line here is “keep the test samples from the training process”. Does augmenting new training samples with reference to the test set cross the line? No, because no augmented sample is a replicate of any test sample. In other words, even if the model overfits to the augmented samples, it does not remember any test sample at all and thus won’t gain benefit in the test score.

Given the cause & effect and the line drawn, how can we please the test set? We prepare either a training or a dev set that shares the same distribution of data as the test set, or we prepare both. In either way, we make sure no sample is shared between the test set and the train/dev set, so no “remembering” of test sample by the model can occur.

Now, your questions:

If the dev set has the same distribution as the test set, then either referring to the dev or the test set makes no difference.

Even if we refer to the test set, we don’t make copies of test samples into the training/dev set, and in this sense, the test set is still untouched. (we have difference in the meaning of touch)

I recommend this lecture that says dev/test set should distribute in the same way.

We can find many similar say to this:

image

Here the keyword is “unseen data”. So, our line is drawn correctly - “keep the test samples from the training process”, because we can make sure the generalization score is evaluated with unseen data.

That’s true, but this is also how we please the test set, which is the purpose of the model. ML is task-specific, and it is specific to the task defined by the test set. Therefore, it only makes sense for us to refer to the test set in some way, right? Can you imagine we know nothing about the test data (real world data)?

You said appearing in the real world, and I say appearing in the test set. However, the test set should reflect the real world, so, by extension, I agree with you!

Cheers,
Raymond

2 Likes

By the way, @jaejun02, I just realized that you were asking questions about the MLS but, in my last reply, I was quoting lectures from the DLS.

So, I would like to share two ways you can have free access to them:

Even though they are DLS course 3 lectures, they don’t really require knowledge from course 1 & 2, so you might choose to watch them as supplementary for your MLS journey.

Cheers,
Raymond

2 Likes