Option 3: adjust quality of train/dev/test set to match quality of prediction set pictures

When creating a post, please add:

In Cat App example (with 205000/2500/2500 initial train/dev/tes sizes), where cat/not-a-cat prediction is made on pictures submitted by the users, which are, in general, of worse quality than pictures in the training set the argument is made for Option 2 i.e. that if we have 10,000 pictures submitted by users, the best use of these pictures would be to split them between train/dev/test sets in 5000/2500/2500 proportion.

What about Option 3: modify original train/dev/test pictures to have quality matching quality of the pictures submitted by the users? Does this idea have any merit?

The argument for it is, that after such transformation train/dev/test distribution would much better distribution of the prediction set (pictures submitted by users). Of course, to make it work, we would need to have some robust way of measuring “quality” of the pictures, etc.

Right, so how would you do that? Then you need a transformation function that you can use to implement your “make the quality more congruent” algorithm. If you are starting with 210k images before you add the 10k, then you obviously need an automated way to implement whatever it is that you specify there.

As a starting point, recall that a machine learning model is trained on images of a particular format, meaning that any images that are used have already been converted to the same pixel size and image type (greyscale, RGB, CMYK …).

But perhaps the simpler approach is simply to say that if your training dataset is from the wrong distribution, you need to get a better training set. But the problem is that is not always an inexpensive thing to do.

Actually, in one of the later videos (Week 2 > Addressing Data Mismatch) something quite similar is suggested (adding car noise to “clean” recorded audio to make training data more congruent with dev/test data - which typically includes car noise). Maybe, in “cat app” example, the most practical approach would be to start with option 2, and as time goes on and presumably more and more pictures are submitted by the users (and assuming these are/can be labeled) keep adding them to the training set. This way, gap between training and dev/test distributions will continue to shrink.