The models we train here (even the more advanced ones in Week 4 of this course, using the same dataset) just do not “generalize” well. The dataset is extremely small for a problem this difficult. It’s actually kind of amazing that the results are as good as they are. It turns out that they are “cooked” to be this good. Here’s a thread which discusses that in more detail w.r.t. the full 4 layer implementation of a Neural Net that we use for this same “cat detection” problem in Week 4.