C3_W2: Distribution of images across different labels in training set

AvijeetPrasad · January 17, 2024, 6:45am

Let’s say I have 10,000 images that I want to use to train as a cat classifier. Out of the 10,000 images, 2,000 are cat images, and the other 8,000 images have a similar background but no cat in the image.

I was wondering if there is any issue with having more ‘negative’ label data than the ‘positive’ label. Are there any guidelines on how the distribution of images should be across the labels for the training set?

TMosh · January 17, 2024, 6:57am

Generally you would like to have equal numbers of ‘true’ and ‘false’ cases.
But some amount of skew is acceptable. 10% of ‘true’ examples seems to be a good working limit.

It depends somewhat on the total number of examples you have for training.

Topic		Replies	Views
Training set label distribution AI Discussions ai-discussions , data-centric	2	67	January 3, 2022
My model predictions Neural Networks and Deep Learning coursera-platform	4	649	September 7, 2021
Over 90% accuracy but wrong predictions AI Discussions ai-discussions	14	1004	April 16, 2024
Data set distribution problem AI Discussions ai-discussions	1	81	March 10, 2024
CNN models with a small dataset of images - are the results meaningful? AI Discussions	8	68	June 27, 2022

C3_W2: Distribution of images across different labels in training set

Related topics