Why Regularization Reduces Over Fitting Lecture


If the purpose of Regularization is to cancel out the effect of some neurons. Doesn’t that mean in practice that we shouldn’t think too much about our NN size? We can just use Regularization later to get the right size?

I think there is the kernel of a good intuition there! Maybe going into a little more detail would flesh it out:

The point is that when you are tuning your hyperparameters, you have a lot of degrees of freedom. Of course that means that the search space is huge and it is intimidating and potentially pretty expensive to do an exhaustive exploration. Just think of it at the level of the number of layers and numbers of neurons per layer and activation functions and that’s already a huge search space. So rather than doing that exhaustive search, you can afford to have a bit of “overkill” in terms of the complexity of your network and then “dial in” some regularization to damp down the overfitting. If you think about tuning L2 regularization, for example, then it is way less complicated: you have only one hyperparameter to tune, which is the \lambda value. So it’s much less time consuming to figure out a good value for \lambda. A similar argument can be made for dropout, where you just need to tune the “keep probability” value.

Of course the size of your network has pretty direct implications for the training cost, so you don’t want to go too far overboard in terms of the total number of layers and neurons.

1 Like