[Course 2] Regularization effect with Smaller NN

Why does using a smaller NN seems like to have a regularization effect?


It is the case that using a smaller (simpler) NN architecture and using regularization are two different ways to approach the problem of overfitting (high variance). I am not a theoretician or an expert in any of this, so it’s also possible that the two methods are theoretically equivalent. But it is common in mathematics that things can be theoretically equivalent, but not equivalent in practical terms. Let’s suppose that you have an overfitting problem and want to approach that by using a simpler network. The issue you immediately face is that there are lots of “degrees of freedom” in terms of ways you could approach that: you could decrease the number of neurons in some or all of the hidden layers. You could try fewer hidden layers, but perhaps with a few more neurons in some of them such that the aggregate complexity is lower. But for every choice like that, you then have to retrain your network and compare the results. So it will be a complicated search space and require some thinking about how to explore that space in an organized and efficient way. As an alternative, consider how you could appoach that by using L2 regularization: you only have one “knob” to turn, which is the \lambda value to use. That’s a linear space to search, so you could just pick a few \lambda values across a range from small to large and then fine-tune from there.

In other words, you could achieve the same goal by using a smaller NN architecture, but it might be easier to just start with a network that is a bit too complex (overkill) and then dial in just the right amount of regularization.

Please note that I am just a fellow student and do not have any practical experience applying these techniques. So the above is just an idea that I’m suggesting, not something I’ve heard Prof Ng specifically state.

If I’m missing your point, please let me know.