I’m working on C2W4 assignment in the Tensorflow Developer course. I’m able to complete the assignment and get decent results without issues. However, There is a small thing that I cannot get my head around.
In the deep learning specialization Dr. Andrew explains the concepts of Overfitting/Underfitting early on the course. He explains the point in good details. The buttom line is that Overfitting is when there is a low bias ( training dataset on average prediction is very very close the truth) and high variance ( validation dataset on average prediction is far away from the truth). The solution is to get a smaller network that first just right or increase the data . and all the techniques are around this two concepts such as L2 regularization , dropout and augmentation …etc.
So my own conclusion from what I understood, is that the smaller the network is the better as long as it fits my training data. Then I can work on other regularization techniques to address the variance in the validation set.
What I have noticed is from the C2W4 assignment. Is that you can fit the training data in a very small networks with 2 or 4 Convolution filters in two layers. and the Dense layer can be small with around 32 neurons. On the training set the data is completely fit and validation is showing 80%- this is a classical example of overfitting. Thus, based on my conclusion now I only need to using regularization and other overfitting mitigation techniques.
What Puzzles me is that I noticed that even increasing the network size ( using more filters , more neurons on the dense layer ) also reduces the overfitting, This contradicts what I concluded from Dr. Andrews explanation. Does this mean that my conclusion is wrong, other is there something else that I’m missing ? If a small network overfits and performs bad on the validation data. How can a bigger network still overfit but performs better on the validation data ?
Hi, @mousa_alsulaimi !
As you may have noticed, the underfitting-overfitting concept is usually a general intuition that comes handy when you first train a deep learning model. Nevertheless, it is pretty much more complicated than that when you try to tweak and fine tune the network.
Despite this bias-variance concept being right, in reality you will end up facing other obstacles, such as local minima, hyperparameters settings, data distribution shifts, vanishing and exploding gradients, etc.
In this particular example, it seems that the smaller model does not really learn the nature of the training data very well, since the bigger one can generalize better. I would like to know a little more about it:
- How good do both models perform in terms of loss and metrics on train and val?
- Is the difference really significative between the smaller and larger models?
- Are all the hyperparameters the same?
adding more layers can contribute to improve the performance and potentially reduce overfitting if you:
- can learn more hierarchical structures that help to solve your problem
- are successful to learn more abstract and complex behaviour (often the example is used in image processing: in the first layers you learn edges in your filters, in the next layers more complex shapes, finally in following layers you combine these shapes e.g. to describe and finally predict your classification class)
Assuming that your two models are really comparable with respect to data and chosen hyperparameters: I would have expected that your training performance would also get better in absolute terms. Could you provide your loss curves of train/dev set for both variants?
In general, I agree with you: increasing the model complexity with more parameters can increase the risk of overfitting. In the end it’s a trade-off to find a sweet spot between allowing the model to learn more complex & abstract patterns and not having too many model parameters, considering the available data.