No cat at all, but predicting a cat


my L-Layer model reached a testing set accuracy of 80%, as predicted.

I find it interesting that this completely cat-free image is classified as “cat”:

I therefore played around with the number and size of the layers and for example increased it to [12288, 64, 64, 32, 1] - just thinking that “more must be better”.

Indeed, the training set accuracy went up to incredible 99,9999999%, but the test set accuracy dropped to little more then 70%, indicating the the model is now overfitting. However, my testing image above was now correctly classified as “non-cat”.

Just wanted to share this experience. It is obviously not really easy to define the number and size of layers. The fitting of training data can reach incredibly high accuracy, but the danger of overfitting is quite high then.

The number of training samples seems much too low to reach higher generalization.

Any other experiences like this? Did anybody find a better configuration?

Best regards


What you are sharing is really interesting. With the small amount of data, increasing the number of layers or the size of the layers makes the model learns a more complex function. As a result, the model is overfit.

It’s always an interesting experience to take a case like this and try to improve or modify it. Thanks for doing this and sharing your results. I think your overall conclusion here is exactly the key point:

Here’s another thread with a different type of experiments with this test case which reaches the same conclusion.


I now randomly generated 100 DNN architectures with the following contraints:

  • number of layers per network from 4 to 8
  • number of units per Layer from 2 to 80
  • number of units from left to right decreasing or equal
  • maximum 2 consecutive layers of same size

Gradient descent with 2500 iterations, learning_rate as in the assignment.

I then trained these 100 DNNs with the given train/test split of the course 1/week 4 programming assignment.

Here are the winners of this little “competition” (accuracy in the “data” column):

As a pattern, it can be seen that the “best” networks all started with around 30 units and mostly had a very small last hidden layer (only 2 units in most cases).

I don’t know if there is a more systematic way then generating such architectures randomly. However, I think much more then 84% testing accuracy won’t be possible with these data and this implementation.

Best regards

Hi, Matthias.

This is really interesting! Thanks very much for your continuing investigations here and for sharing your results. In terms of how to do this kind of thing more systematically, I’ve never really tried this type of experiment, but the idea of randomly exploring that big a search space is probably a reasonable way to do it. Prof Ng will talk about how to choose multiple hyperparameters simultaneously in Course 2 of this series. If I’m remembering the details correctly, he recommends starting with a grid approach, but not to actually evaluate at every point in the grid. Instead he recommends randomly sampling points in the grid to try, which does sound pretty similar to how you approached the problem here.

One other general point is that we are using a pretty simple approach to Gradient Descent here with a fixed learning rate and a fixed number of iterations. My first thought was that it seems like a risky bet to assume that a 6 layer network with (say) 250 total neurons will be able to achieve an equivalent level of convergence as a 4 layer network with 64 total neurons in the same number of iterations with the same learning rate. But interestingly your results show 4, 5 and 6 layer nets as the top 3 finishers, so apparently the fixed LR and iterations does not really interfere that much. You could “hold that thought” and wait until we get to Adam optimizers and TensorFlow in Course 2. Then it would be easier to construct this kind of experiment with more sophisticated forms of GD and see if that really does make any difference.

Thanks again!