How much data does a CNN need to learn?

Along with ‘how many hidden layers?’ or ‘how many filters in a layer?’ this has got to be one of the basic questions people ask about CNNs. The generic answer is ‘more data is better’ but it is hard to find anything that quantifies what the minimum is. Today I did some experiments, which I share below.

tl;dr order of magnitude minimum training images is 10^4

** update 27 FEB 2023 I did more experiments using the Veggies and posted the results here How much data does a CNN need to learn - continuation **

I live on a rural property in the Eastern US, which I monitor with cameras. Each week, I manually review hundreds to a thousand still pictures. Most are benign or rather uninteresting: shadows or tree branches moving due to wind, groups of the over-abundant young White Tail deer, etc. Out of the many thousands of pictures in all, I have a few hundreds of rabbits, a few tens of foxes and other small mammals like skunks, and a handful or less of coyotes, bears messing with my honey bees, or the neighbor’s troublemaking kid up to no good. I’ve been thinking about using the tools from these classes to automate that review process so I only have to look at the most interesting ones.

I know I have a class imbalance problem. And I wondered how many images I would have to label to get anything decent, and whether or not data augmentation could help me solve these. For guidance, I went to the TensorFlow Image Classification tutorial here:
Image classification  |  TensorFlow Core which contains links to the data augmentation tutorial.

I used the model shown on that page, and much of the other code such as the dataset building, data augmentation, and visualization. I used the flowers dataset found there, but also one 10x smaller and one 5x bigger that I found on Kaggle. Details below.

The model is pretty straightforward, and varies only at the output layer due to the difference in the number of classes in the three datasets. Same input size, same Conv2D layers with same filters blah blah blah. Here are the three dataset model summaries:

Chess Pieces
data found at Chessman image dataset | Kaggle

Found 552 files belonging to 6 classes.
Using 497 files for training.
Using 55 files for validation.
[‘Bishop’, ‘King’, ‘Knight’, ‘Pawn’, ‘Queen’, ‘Rook’]

from the augmentation tutorial

Found 3670 files belonging to 5 classes.
Using 3303 files for training.
Using 367 files for validation.
[‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]


Found 15000 files belonging to 15 classes.
Using 13500 files for training.
Using 1500 files for validation.

[‘Bean’, ‘Bitter_Gourd’, ‘Bottle_Gourd’, ‘Brinjal’, ‘Broccoli’, ‘Cabbage’, ‘Capsicum’, ‘Carrot’, ‘Cauliflower’, ‘Cucumber’, ‘Papaya’, ‘Potato’, ‘Pumpkin’, ‘Radish’, ‘Tomato’]

I compiled the models with the same parameters, same loss and optimizer. Trained for the same number of epochs. Turned augmentation on and off. Here is what I found…

Chess Pieces - with only a few hundred training images the model quickly overfit and performed poorly against the validation set. Validation loss was concave up. Augmentation didn’t help.

Flowers - also overfit on the roughly 10x larger dataset, though here augmentation did help.

Veggies - this one performed well out of the box, and didn’t need augmentation.

My high level takeaway is that my admittedly rather simple model didn’t do well with either ~500 or ~5,000 training images while it did much better with ~15,000. If available training data is of the order of 10^2 or 10^3, expect poor results or at a minimum to go through some extra work such as augmentation and balancing or model optimization (architecture and other hyperparameter trials). Somewhere towards 10^4 the amount of data starts to be sufficient. Likely more is still better, but somewhere around 10K records might be good enough. I guess a caveat is that all three datasets are different, and thus still ‘it depends’ so I’ll do some additional experiments with the Veggies and see if I can quantify more precisely what the curve looks like.


1 Like

Thanks for your thoughts and valuable input!

Did you think about quantifying the uncertainty of your model and exploiting this information?

Active learning can help to quantify:

  • which label is expected to provide a valuable benefit and also
  • when a sufficient amount of data has been used to train your model.

Active Learning used in the context when measuring labels is expensive either cost or time-wise. Here is a simplified visualization on what it does at a regression task: using model uncertainty to decide which label to measure next:

Specifically for your problem, I believe the approach could be suitable to measure labels especially at the important areas close to the decision boundary to get a good model also with limited (but the right) data:


If you are interested in sources and applications, free to check out:

Best regards

1 Like

ViewAL is multi-cup-of-coffee paper for me :thinking:

This related one looks closer to what I imagine my task is, which is classifying an entire image with a single label that is something like deer, small-mammal, predator, criminal (that’s for the neighbor’s kid).

Seems like a bit of chicken and egg in that I need a trained model to tell me which images provide useful information about label uncertainty. But if I had a trained model, I wouldn’t be uncertain in the first place ! Maybe I can hand label a couple hundred, start training a model, and use uncertainty on a second batch to identify best candidates to hand label, bootstrapping up to a larger dataset that generalizes better in the end. Food for thought, thanks @Christian_Simonis

@ai_curious , thanks so much for sharing your project!

I know you have a lot of experience with YOLO. Isn’t YOLO an option for your objective?



1 Like

Hi @Juan_Olano, Thanks for the feedback. I really only want to automate the coarse grained review…silently filter out all the common images, raise attention to the rare ones. Those I want to look at myself. So classification at the image level, not the object level, is fine. And no localization needed. Except for the neighbor’s kid…then I want the perimeter defense armament to automatically engage. Kidding, just kidding. Mostly.

Btw here are some of the critters I see in my yard…

Coyote (?)

Red fox

Black bear smiling after wiping out my bee hives

1 Like

WOW that’s very cool! and your farm seems to be a very attractive spot for these critters!

Thank you for the explanation on why not using YOLO for this use case.

This is a nice post and topic you presented @ai_curious. My general thoughts; an AI model will only learn what you teach it, I believe if the model sees a certain situation once (unless there is too much conflict with other situations) it will remember it in its weights (of course depends on model architecture too).

Data augmentation may increase the dataset size but the overall features of the images are almost the same.

I would be curious of what happens if we blend in many objects in one image and keep training the model in detecting many classes at the same time (an image rich enough to train the model well for many scenarios).

Also what if you train the model for the classes that you have many images for check only those images that have no positive classification i.e. no known objects have been detected so you can have a look at those only.

1 Like

Maybe it really is a binary problem after all; 1) merits attention 0) nothing interesting to see here.

Night versus day may also be a challenge. Virtually all the false positives are in the daylight. At night it is very rare for the sensors to trigger unless there is an animal in range. Perhaps I should switch to taking black and white daytime so the palettes are closer.

1 Like

I think if you have enough images for nights and days and their number is balanced it should be ok, and yeah its probably coming down to this, “Maybe it really is a binary problem after all; 1) merits attention 0) nothing interesting to see here.”

I thought of three other reasons to not use YOLO. First is that I don’t need anything approaching near real time processing. I download the images from the memory cards days to weeks after they are taken, and have all the time in the world to run the processing. Second, because YOLO has 20 gazillion parameters to learn, it takes a LOT of data, with augmentation that ensures all the grid cell + anchor boxes ‘detectors’ get positive and negative training, and a LOT of horsepower to do all those computations. I am trying to find a way to use as little data as I can get away with and run it on my wee Apple Macbook. Finally, I almost never see, and don’t need to classify, more than one type of object per image. That is, if the bear is in the frame, the deer and the neighbor’s kid are not. Since there is no ROI for me other than bragging rights that I got it to work, I will almost certainly skip YOLO and use a simpler CNN.

My thoughts:

  1. Regarding the real time vs delayed analysis of the collected data, you can use yolo also under this circumstance. Yolo can be used based on your data at anytime.

  2. Regarding the gazillion parameters, I would suggest using one of the small models, may be yolov8x. Very small footprint.

  3. Regarding the need of a single class per frame, also no problem. Yolo works.

4 Regarding the amount of data needed to fine-tune the model, this will hold true for any solution you seek. All of them will need probably the same amount of training data.

Well, I read this and it feels as if I would be representig Ultralytics, but no, I have no relationship with them :slight_smile:

Hello @ai_curious,

You have many neighbours! I think these are the “interesting” photos you want to check on yourself.

So the problem is too many false positives?

Usually I prefer to make specific suggestion after looking at the data, because then the chance that it makes sense will be higher. However, I think I can rely on you to filter out the senseless part and perhaps turn them into feasible ideas :wink:

I assume you have still cameras, and in the view of each camera, the overall background is more or less the same at the same hour of day, except may be due to some weather condition. So I was wondering, if you can get 100 “uninteresting photos” during the day, and another set of 100 during the night. Then you take an average on the day-set and another average on the night-set. Then when preparing your dataset for training or inference, you subtract any day-image with the day-average, and subtract any night-image with the night-average. In this way, you can get rid of a lot of useless background information at training/inference.

I talked about only a day-set and a night-set, but we can have 4 sets per day or 3 sets per day. We may need to redefine the hours for each set depending on the seasons.

What do you think?


Just wondering how many “interesting”/“uninteresting” photos are there in the day and in the night. I sometimes find the data to be more interesting when looking for improvements :wink:

This is an interesting idea Raymond! I am only curious of one thing. Please help me think it through:

Lets say we have a day pic of a rabbit and we substract the avg of day-pics. What becomes of the rabbit?

Should I then prepare the rabbit samples with a subtraction of the average abd train with this? I am thinking this would solve it. Or am I overthinking it?

Hello Juan,

The rabbit will look uglier :stuck_out_tongue_winking_eye:, but if it was sufficient for the model, then it would be acceptable.

This is one way of augmentation, but let’s see how @ai_curious think about this idea first, or if it is viable, some subtracted images first. I hope I could just try it out myself immediately :laughing: :laughing:


Btw, after subtraction, some pixel values will become negative. When drawing those subtracted images, it’s important not to clip the negatives off, but save them by adding the minimum value (or taking the absolute value).

For human, it is useful in saving those negative pixels.
For ML model, it might not be needed.

The cameras can shoot video or still but I use still only due to storage limitations. They remain at fixed position for long periods of time…months. I move them around the property occasionally to keep the neighbors kid guessing but might not do if it was helpful to the ML project. I consider the problem to be that the cameras have a lower/different threshold for positive than I do. The sun or moon moving behind a cloud can be enough to trigger a photo. Moderate snowfall or heavy rain. TBH sometimes I have no idea why a picture was taken. The cameras don’t offer fine grained sensitivity control. Not exactly on/off but close. To date I haven’t done any EDA on the pixel values. Maybe there is a simple statistical measure that correlates with what I consider ‘interesting’. In a way it is kind of ‘something is in the foreground of this one that isn’t usually there’. So need a metric for usually. Thanks all for the thoughts and suggestions.

You are welcome, @ai_curious!

Although I don’t have a concrete answer to the question in the title of this thread, but hope that if the subtraction trick worked, we might need less data. This is how my idea can be related to the thread.

I think it’s fine to move it around, because it shouldn’t take long to re-establish a new set of averages to subtract with.

:smile: That was another side I was trying to see. But who knows? Even if simple statistical measure can do the job of distinguishing interesting photos from not interesting photos, maybe, in the future, there would be some reaons for you to re-consider CNN for more dedicated classification. At that time, the subtraction trick can still help.