No cat at all, but predicting a cat

Matthias_Kleine · January 12, 2023, 11:01pm

Hi,

my L-Layer model reached a testing set accuracy of 80%, as predicted.

I find it interesting that this completely cat-free image is classified as “cat”:

I therefore played around with the number and size of the layers and for example increased it to [12288, 64, 64, 32, 1] - just thinking that “more must be better”.

Indeed, the training set accuracy went up to incredible 99,9999999%, but the test set accuracy dropped to little more then 70%, indicating the the model is now overfitting. However, my testing image above was now correctly classified as “non-cat”.

Just wanted to share this experience. It is obviously not really easy to define the number and size of layers. The fitting of training data can reach incredibly high accuracy, but the danger of overfitting is quite high then.

The number of training samples seems much too low to reach higher generalization.

Any other experiences like this? Did anybody find a better configuration?

Best regards
Matthias

Phuc_Kien_Bui · January 13, 2023, 1:29am

What you are sharing is really interesting. With the small amount of data, increasing the number of layers or the size of the layers makes the model learns a more complex function. As a result, the model is overfit.

paulinpaloalto · January 13, 2023, 1:39am

It’s always an interesting experience to take a case like this and try to improve or modify it. Thanks for doing this and sharing your results. I think your overall conclusion here is exactly the key point:

Here’s another thread with a different type of experiments with this test case which reaches the same conclusion.

Matthias_Kleine · January 13, 2023, 3:19pm

I now randomly generated 100 DNN architectures with the following contraints:

number of layers per network from 4 to 8
number of units per Layer from 2 to 80
number of units from left to right decreasing or equal
maximum 2 consecutive layers of same size

Gradient descent with 2500 iterations, learning_rate as in the assignment.

I then trained these 100 DNNs with the given train/test split of the course 1/week 4 programming assignment.

Here are the winners of this little “competition” (accuracy in the “data” column):

As a pattern, it can be seen that the “best” networks all started with around 30 units and mostly had a very small last hidden layer (only 2 units in most cases).

I don’t know if there is a more systematic way then generating such architectures randomly. However, I think much more then 84% testing accuracy won’t be possible with these data and this implementation.

Best regards
Matthias

paulinpaloalto · January 13, 2023, 10:38pm

Hi, Matthias.

This is really interesting! Thanks very much for your continuing investigations here and for sharing your results. In terms of how to do this kind of thing more systematically, I’ve never really tried this type of experiment, but the idea of randomly exploring that big a search space is probably a reasonable way to do it. Prof Ng will talk about how to choose multiple hyperparameters simultaneously in Course 2 of this series. If I’m remembering the details correctly, he recommends starting with a grid approach, but not to actually evaluate at every point in the grid. Instead he recommends randomly sampling points in the grid to try, which does sound pretty similar to how you approached the problem here.

One other general point is that we are using a pretty simple approach to Gradient Descent here with a fixed learning rate and a fixed number of iterations. My first thought was that it seems like a risky bet to assume that a 6 layer network with (say) 250 total neurons will be able to achieve an equivalent level of convergence as a 4 layer network with 64 total neurons in the same number of iterations with the same learning rate. But interestingly your results show 4, 5 and 6 layer nets as the top 3 finishers, so apparently the fixed LR and iterations does not really interfere that much. You could “hold that thought” and wait until we get to Adam optimizers and TensorFlow in Course 2. Then it would be easier to construct this kind of experiment with more sophisticated forms of GD and see if that really does make any difference.

Thanks again!

Cheers,
Paul

Topic		Replies	Views
DLS1 week 4 assignment 2: 2 hidden layers VS 4 - sharing views Neural Networks and Deep Learning coursera-platform	3	549	August 30, 2021
W4_Overfitting of the Model vs Training Accuracy Neural Networks and Deep Learning coursera-platform	6	629	April 8, 2023
Course 1, week 4 assignment 2 Neural Networks and Deep Learning coursera-platform	1	549	December 22, 2021
Course 3 Week1: put it into practice, cat classification Structuring Machine Learning Projects week-module-1 , coursera-platform	22	557	January 31, 2024
Overfitting in W4_A2_Ex2 Neural Networks and Deep Learning coursera-platform	4	558	October 15, 2021

No cat at all, but predicting a cat

Related topics