Training set label distribution

auch_hunter · January 2, 2022, 2:24pm

I did not find (notice) information about training set distribution. In particular i am interesting in how many examples of each label should be in the training. For example lets take cat detection problem, how many images with label = 1 and how many with label = 0 should be in my training set? My guess is approximately half/half. But is it a big problem if distribution is skewed, for example i do not have enough not cat (label = 0) examples? Thanks in advance!

SomeshChatterjee · January 3, 2022, 3:17am

Hi, welcome to discourse!

Yes, you are correct, it is ideal if the number of samples with label 0 and label 1 are similar. However, in real life it is very common to get a dataset with large number of labels as '0’s and very few as '1’s (or vice-versa) and it is indeed a problem. Scenarios like this are common in use-cases like fraud detection (number of genuine transactions is far greater than those of fraud transactions).

There are multiple ways to handle this scenario, like you can under sample the class (label) with large amount of data or we could generate synthetic data for the under-represented class using techniques like SMOTE or any other data-augmentation techniques.

I have given a simplified view of what can be done, hope it helps.

Thanks.

ai_curious · January 3, 2022, 10:38pm

A term commonly used and that will help you find information on this topic is class imbalance. A search on those words turns up lots of resources.

Topic		Replies	Views
C3_W2: Distribution of images across different labels in training set Structuring Machine Learning Projects week-2 , ai-discussions , data-centric , coursera-platform	1	269	January 17, 2024
Data set distribution problem AI Discussions ai-discussions	1	81	March 10, 2024
Over 90% accuracy but wrong predictions AI Discussions ai-discussions	14	1004	April 16, 2024
Building ML model for increasing loan acceptance rate by targeting specific customers AI Discussions feedback , ai-discussions , project	21	306	September 11, 2024
CNN models with a small dataset of images - are the results meaningful? AI Discussions	8	68	June 27, 2022

Training set label distribution

Related topics