Doubt in non-graded portion in week2 ResNets programming assignment

When I tried to make predictions by feeding the model with my own hand images for some signed numbers, the model was unable to recognise the numbers correctly(it classified 3 as 1 and 4 as 2). Also there is a question at the end of the notebook -
"Even though the model has high accuracy, it might be performing poorly on your own set of images. Notice that, the shape of the pictures, the lighting where the photos were taken, and all of the preprocessing steps can have an impact on the performance of the model. Considering everything you have learned in this specialization so far, what do you think might be the cause here?

Hint: It might be related to some distributions. Can you come up with a potential solution ?"
Can someone please help in the reasoning for why the model is not doing well on new images?

The fundamental issue with all the assignments in the courses here is that there are severe limitations on the compute and storage resources that can be used by the course notebooks. So the training sets are all unrealistically small. In order to get good “generalizable” models, your training data needs to reflect the statistical distribution of the “real” inputs your model needs to handle well.

As one example of this phenomenon, remember back in Course 1 Week 4 where we trained the “cat recognition” model with literally 209 training samples and 50 test samples. By “real world” standards, that is a laughably small dataset. Here’s a thread which does some experiments to try to perturb that dataset a little and show that it’s actually very carefully curated to work as well as it does.