Option 3: adjust quality of train/dev/test set to match quality of prediction set pictures

ryszard314159 · March 9, 2024, 5:56pm

When creating a post, please add:

Link to the classroom item you are referring to:
Description (include relevant info but please do not post solution code or your entire notebook)

In Cat App example (with 205000/2500/2500 initial train/dev/tes sizes), where cat/not-a-cat prediction is made on pictures submitted by the users, which are, in general, of worse quality than pictures in the training set the argument is made for Option 2 i.e. that if we have 10,000 pictures submitted by users, the best use of these pictures would be to split them between train/dev/test sets in 5000/2500/2500 proportion.

What about Option 3: modify original train/dev/test pictures to have quality matching quality of the pictures submitted by the users? Does this idea have any merit?

The argument for it is, that after such transformation train/dev/test distribution would much better distribution of the prediction set (pictures submitted by users). Of course, to make it work, we would need to have some robust way of measuring “quality” of the pictures, etc.

paulinpaloalto · March 9, 2024, 7:32pm

Right, so how would you do that? Then you need a transformation function that you can use to implement your “make the quality more congruent” algorithm. If you are starting with 210k images before you add the 10k, then you obviously need an automated way to implement whatever it is that you specify there.

As a starting point, recall that a machine learning model is trained on images of a particular format, meaning that any images that are used have already been converted to the same pixel size and image type (greyscale, RGB, CMYK …).

But perhaps the simpler approach is simply to say that if your training dataset is from the wrong distribution, you need to get a better training set. But the problem is that is not always an inexpensive thing to do.

ryszard314159 · March 9, 2024, 9:37pm

Actually, in one of the later videos (Week 2 > Addressing Data Mismatch) something quite similar is suggested (adding car noise to “clean” recorded audio to make training data more congruent with dev/test data - which typically includes car noise). Maybe, in “cat app” example, the most practical approach would be to start with option 2, and as time goes on and presumably more and more pictures are submitted by the users (and assuming these are/can be labeled) keep adding them to the training set. This way, gap between training and dev/test distributions will continue to shrink.

Topic		Replies	Views
Using Transfer Learning to deal with Data Mismatch Structuring Machine Learning Projects coursera-platform	1	560	May 31, 2021
Training and Testing on Different Distributions Structuring Machine Learning Projects coursera-platform	3	663	April 26, 2021
Change dev/test set scenarios, course3, week1 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	534	June 10, 2022
New 1000 images after model development (train/dev/test), where to add? Structuring Machine Learning Projects coursera-platform	12	717	July 5, 2023
Conflicts in Course3 W1 quiz Q14 Structuring Machine Learning Projects coursera-platform	2	616	July 26, 2023

Option 3: adjust quality of train/dev/test set to match quality of prediction set pictures

Related topics