Data set distribution problem

Chiang_Yuhan · March 9, 2024, 2:50am

Hello! I have encountered a problem in my hands-on project regarding the datasets of deep neural networks

According to the video in DLS course, I am tackling a problem that requires me to perform an analysis on different subsets (or sources) of the same kind of data.(Kind of like the example of web-page cats (large set) and consumer camera cats(small set)). Is it sensible for me to do the following?

Step 1. Inject some consumer cats examples to my training set which is primarily web-paged cats
– so that the NN can learn with a bigger dataset to achieve higher accuracy

Step 2. Use a higher percentage of consumer cat in my validation and test set

Or:
Step 1. Inject some consumer cats examples to all of the data sets then split the train-dev-test sets randomly at a reasonable percentage?

Personally I have found the second step better for performance in the test set, but I want to make sure that it is logically sound.

Thank you in advance,
Yuhan Chiang

Abdul_Ali_Nawrozie · March 10, 2024, 12:17am

Hi @Chiang_Yuhan
I think both methods have pros and cons, if you are using data from the web only for training your model, then your model may not generalize well because of similar/duplicate images. and if you use web images for training and testing, you might get higher accuracy because of similar/duplicate images, leading to overestimating your model. you can use some other techniques such as augmentation to increase the diversity of your training dataset without combining with other data.

Topic		Replies	Views
Neural network in real world Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	545	May 10, 2022
Training set label distribution AI Discussions ai-discussions , data-centric	2	68	January 3, 2022
Training and Testing on Different Distributions Structuring Machine Learning Projects coursera-platform	3	663	April 26, 2021
Using Transfer Learning to deal with Data Mismatch Structuring Machine Learning Projects coursera-platform	1	560	May 31, 2021
W2_A2_Error_Initialize with zeros undefined Neural Networks and Deep Learning coursera-platform	8	392	November 25, 2023

Data set distribution problem

Related topics