I’m working on a problem that can best be described as a variation on image detection. I’m using a ConvNet with 2 convolutional layers, 2 pooling layers, 2 fully-connected layers, and a softmax layer. The total number of parameters for my model is approximately 150,000. The shape of my input images are (40,40,1), 1600 pixel gray scale images. Can anyone provide some guidance on how many training samples I would need to get decent predictive performance? I’m not looking for an exact number, just an estimate on the order of magnitude. Thanks in advance.
I don’t think there is a universally valid answer to a question like this. It all depends on how complex the images are and how much variability there is in the features you are trying to detect. E.g. if they were all images of single printed characters in one given font and the images were all oriented right side up and nicely centered in the “frame”, then you could probably get away with O(10^3) or worst case O(10^4). But it all depends. One concrete example I know of is the famous MNIST hand written digit sample dataset. Those are 28 x 28 greyscale images and the full dataset is 60k images. By subdividing that full set appropriately into train/dev/test sets you can get quite good performance and that is at least intuitively a harder problem than the hypothetical recognition of printed characters in a fixed font that I gave above.