I recently did a few experiments to try to answer this question for myself and wrote about it here:
My admittedly unscientific approach suggests order of magnitude 10^4 images is where there is enough data to dominate hyperparameter and architecture choices and not require augmentation. If you have only 10’s, 100’s or even 1000’s of training images I think you are going to have to do some extra work to try to get generalizable results. I am hoping others will weigh in since it is something I am actively working on.