Data set distribution problem

Hello! I have encountered a problem in my hands-on project regarding the datasets of deep neural networks

According to the video in DLS course, I am tackling a problem that requires me to perform an analysis on different subsets (or sources) of the same kind of data.(Kind of like the example of web-page cats (large set) and consumer camera cats(small set)). Is it sensible for me to do the following?

Step 1. Inject some consumer cats examples to my training set which is primarily web-paged cats
– so that the NN can learn with a bigger dataset to achieve higher accuracy

Step 2. Use a higher percentage of consumer cat in my validation and test set

Or:
Step 1. Inject some consumer cats examples to all of the data sets then split the train-dev-test sets randomly at a reasonable percentage?

Personally I have found the second step better for performance in the test set, but I want to make sure that it is logically sound.

Thank you in advance,
Yuhan Chiang

Hi @Chiang_Yuhan
I think both methods have pros and cons, if you are using data from the web only for training your model, then your model may not generalize well because of similar/duplicate images. and if you use web images for training and testing, you might get higher accuracy because of similar/duplicate images, leading to overestimating your model. you can use some other techniques such as augmentation to increase the diversity of your training dataset without combining with other data.

1 Like