Week 1: dropout vs reducing network?

It is an interesting point. At a high level, what you are saying must be correct. I do not really know the definitive answer, but if people as smart as Geoff Hinton, Yann LeCun and Andrew Ng think that dropout is a useful concept, then there must be more subtleties here. Maybe you would find some comment on this point in the original paper from Prof Hinton’s group which introduces dropout.

Here is my guess about regularization in general: it’s just easier to tune than trying to tune the architecture of your network. If you think about it, there are lots of ways you can change the architecture of your network: you can add or subtract layers and you can add or subtract neurons from any of the various layers. That’s a lot of degrees of freedom and hence a large search space to explore. Maybe it’s simpler just to make your network a bit bigger than you really need (“overkill”) and then just “dial in” a bit of regularization to damp down the overfitting (high variance) that may result. E.g. in the case of L2 regularization, you only have to do a binary search on one hyperparameter \lambda. In the case of dropout, it’s a little more complex in that you have to choose which layers to apply dropout and you could even use different keep_prob values at different layers. But the point is that you still have fewer “knobs to turn” and maybe that saves work overall. That’s just my guess not based on any actual knowledge :nerd_face:.

Or to state my conjecture in the same terms in which you asked the question: maybe tuning the size of the network is not easier. It may be conceptually simpler, but in practice it’s not actually easier to execute. Regularization is actually the easier path to achieve the desired balance between bias and variance.

Actually this seems like an instance in which the famous A. Einstein quote applies: “In theory, theory and practice are the same. In practice, they’re not.”

1 Like