Dropout, L2 regularization,

Ng teaches dropout and L2 regularization as a techniqute to address overfitting (high variance).

Both, in my opinion, effectively limits the model to do the high order approximation, which is we know from the traditional numerical methods that they can produce oscillatory, spurrious result without filtering or other disspative methods.

My question is that do we have any guidelines or short recommendations to pick which ones are efective?. Or we have to try both and decide which one works better for each case.

I favor dropout since keep_prob rate is between 0 and 1, and it is easy to play and find the optimal one. However, in cases of L2 regularization, lambda value is a little ambigious for me to experiment (should I start with 0.0001 or 0.1?, again I have no idea where to start and end), as of today after this lesson, I cannot tell which optimal range I should look for lambda. However, keep_prob is definitely easy and we know that the range is in (0, 1).

Of course, it might be cases that these are two effective tools, not necessarily pick one over other, some cases, we might want to deploy both.

Can we say that L2 regularization has its own place and similary dropout ?

Another reason is that I like dropout as a method of regularization that it can surgically control the smoothing between layers --such as you might want to apply dropout where a lot of neurons but limit droptout in less neuron-layers.

Yes, as Tom says, this is just experimental. I should make the disclaimer that I have never had a job doing ML/DL or any kind of AI: all I know is what I’ve heard Professor Ng say in these courses. But one thing that is true of Professor Ng is that if there are “rules of thumb” in any given case that he’s teaching us, he will tell us about them and how to apply them. So I think it’s a reasonable deduction to conclude here that there are none. Of course you can also try some googling and see if you find more help out there.

You make an interesting point that dropout is more configurable than L2, in that you could choose only some of the layers to apply dropout and select different “keep prob” values per layer. Note that when we graduate to using TensorFlow (coming soon in Week 3), even there we still have the ability to configure which layers have dropout and select the “keep” values in each instance. In TF, dropout is structured as a “layer function”, so you choose which layers will have it.

If you were writing the L2 code yourself, you could actually do a similar sort of customization: you could choose to omit some layers from the “sum of the squares” of the weights. Or you could separate the sums for various layers and use a different \lambda value per layer.

Of course more configurability is the classic double edged sword: now your hyperparameter search space is suddenly quite a bit bigger. As you say, maybe dropout is the best place to start in most overfitting cases. See whether you can meet your accuracy requirements that way first. Only if that fails, you can consider other alternatives.

Thank you Paul,

I forget that L2 has summation over layer (l), we can also have control to apply which layers.

Thank for your detailed discussion,