Hi,
In the last part of the assignment, we can see different training accuracy when we try different number of hidden nodes. And it shows a sign of overfitting as the number of hidden layer nodes increases.
I understand there will be overfitting if hidden layer nodes are too many. However, why do we observe overfitting without test data? Normally, the overfitting happens when we have nearly perfect prediction on training data, but worse prediction on testing data. As the hidden layer nodes increase, why do we have worse training prediction accuracy?
Hope someone can correct my logic!
Many thanks.
This assignment is kind of a special case in that there is no “test” data, as you observe. So the definition of “overfitting” does not even apply: we want as perfect a fit as we can get. My guess is that the problem with very large numbers of hidden layer nodes is that the training becomes very expensive and perhaps the convergence is also more difficult, meaning that you need a more sophisticated strategy with dynamically managed learning rates or the like. It’s not a given that convergence will work the same with different numbers of neurons: you may have to tweak other hyperparameters like number of iterations and learning rate.
I just checked the results that I got in that section of the notebook and here they are:
Accuracy for 1 hidden units: 67.5 %
Accuracy for 2 hidden units: 67.25 %
Accuracy for 3 hidden units: 90.75 %
Accuracy for 4 hidden units: 90.5 %
Accuracy for 5 hidden units: 91.25 %
Accuracy for 20 hidden units: 90.75 %
Accuracy for 50 hidden units: 90.25 %
I would say that the difference between 90.25 and 91.25% accuracy is basically in the noise, but you could try some experiments with running more iterations in the n = 20 and n = 50 cases and see if the accuracy improves.
Thanks Paul! Your explanation makes sense to me.
But it would be better if Week 3 Assignment can be modified with regard to this section, because the “overfitting” saying is kind of confused? (even if this part is optional)
That’s a good point. I had forgotten the comments that they make about overfitting. I think they are speaking “in general”, meaning that in the normal case where you are training a model that is intended to apply to multiple different inputs, a network with large numbers of nodes will tend to overfit, unless you apply regularization.
But the specific case here is not learning a “general” model, so that concept doesn’t really apply. I’ll file an issue with the course staff and hope that they can come up with some better wording in that section.
Thank you for pointing that out!