Training set label distribution

I did not find (notice) information about training set distribution. In particular i am interesting in how many examples of each label should be in the training. For example lets take cat detection problem, how many images with label = 1 and how many with label = 0 should be in my training set? My guess is approximately half/half. But is it a big problem if distribution is skewed, for example i do not have enough not cat (label = 0) examples? Thanks in advance!

Hi, welcome to discourse!

Yes, you are correct, it is ideal if the number of samples with label 0 and label 1 are similar. However, in real life it is very common to get a dataset with large number of labels as '0’s and very few as '1’s (or vice-versa) and it is indeed a problem. Scenarios like this are common in use-cases like fraud detection (number of genuine transactions is far greater than those of fraud transactions).

There are multiple ways to handle this scenario, like you can under sample the class (label) with large amount of data or we could generate synthetic data for the under-represented class using techniques like SMOTE or any other data-augmentation techniques.

I have given a simplified view of what can be done, hope it helps.



A term commonly used and that will help you find information on this topic is class imbalance. A search on those words turns up lots of resources.