Along with ‘how many hidden layers?’ or ‘how many filters in a layer?’ this has got to be one of the basic questions people ask about CNNs. The generic answer is ‘more data is better’ but it is hard to find anything that quantifies what the minimum is. Today I did some experiments, which I share below.
tl;dr order of magnitude minimum training images is 10^4
** update 27 FEB 2023 I did more experiments using the Veggies and posted the results here How much data does a CNN need to learn - continuation **
I live on a rural property in the Eastern US, which I monitor with cameras. Each week, I manually review hundreds to a thousand still pictures. Most are benign or rather uninteresting: shadows or tree branches moving due to wind, groups of the over-abundant young White Tail deer, etc. Out of the many thousands of pictures in all, I have a few hundreds of rabbits, a few tens of foxes and other small mammals like skunks, and a handful or less of coyotes, bears messing with my honey bees, or the neighbor’s troublemaking kid up to no good. I’ve been thinking about using the tools from these classes to automate that review process so I only have to look at the most interesting ones.
I know I have a class imbalance problem. And I wondered how many images I would have to label to get anything decent, and whether or not data augmentation could help me solve these. For guidance, I went to the TensorFlow Image Classification tutorial here:
Image classification | TensorFlow Core which contains links to the data augmentation tutorial.
I used the model shown on that page, and much of the other code such as the dataset building, data augmentation, and visualization. I used the flowers dataset found there, but also one 10x smaller and one 5x bigger that I found on Kaggle. Details below.
The model is pretty straightforward, and varies only at the output layer due to the difference in the number of classes in the three datasets. Same input size, same Conv2D layers with same filters blah blah blah. Here are the three dataset model summaries:
Chess Pieces
data found at Chessman image dataset | Kaggle
Found 552 files belonging to 6 classes.
Using 497 files for training.
Using 55 files for validation.
[‘Bishop’, ‘King’, ‘Knight’, ‘Pawn’, ‘Queen’, ‘Rook’]
Flowers
from the augmentation tutorial
Found 3670 files belonging to 5 classes.
Using 3303 files for training.
Using 367 files for validation.
[‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]
Found 15000 files belonging to 15 classes.
Using 13500 files for training.
Using 1500 files for validation.
[‘Bean’, ‘Bitter_Gourd’, ‘Bottle_Gourd’, ‘Brinjal’, ‘Broccoli’, ‘Cabbage’, ‘Capsicum’, ‘Carrot’, ‘Cauliflower’, ‘Cucumber’, ‘Papaya’, ‘Potato’, ‘Pumpkin’, ‘Radish’, ‘Tomato’]
I compiled the models with the same parameters, same loss and optimizer. Trained for the same number of epochs. Turned augmentation on and off. Here is what I found…
Chess Pieces - with only a few hundred training images the model quickly overfit and performed poorly against the validation set. Validation loss was concave up. Augmentation didn’t help.
Flowers - also overfit on the roughly 10x larger dataset, though here augmentation did help.
Veggies - this one performed well out of the box, and didn’t need augmentation.
My high level takeaway is that my admittedly rather simple model didn’t do well with either ~500 or ~5,000 training images while it did much better with ~15,000. If available training data is of the order of 10^2 or 10^3, expect poor results or at a minimum to go through some extra work such as augmentation and balancing or model optimization (architecture and other hyperparameter trials). Somewhere towards 10^4 the amount of data starts to be sufficient. Likely more is still better, but somewhere around 10K records might be good enough. I guess a caveat is that all three datasets are different, and thus still ‘it depends’ so I’ll do some additional experiments with the Veggies and see if I can quantify more precisely what the curve looks like.
Thoughts?