Week 1 - why regularization works with ReLu


In the first week, Prof. Ng explains why regularization works by showing that if the cost function “forces” the weights to be small, that also gives small z (z=W*a), which in turn keeps g(z) to be in the linear region if g(z)=tanH(z). I.e. we get a more linear model, and the analogy is the same I guess with the sigmoid. So far so good.
What I do not understand is why the regularization makes a difference if I use the ReLu. I mean, the ReLu is either linear or 0 regardless if z is “big” positive or negative, or “small” positive or negative.

I obviously miss something. Thanks.

I think this is “over reading” or (worst case) misinterpreting what Prof Ng said. Actually it would be good for me to watch it again. Do you have a reference to the point at which he said that? Which lecture and the time offset?

It sounds to me like he must have meant this as just one of the reasons that regularization can work. What I remember him saying about why having the weights be smaller in general (which is the basic mechanism of L2 regularization) is that it dampens the effect of the model putting too much emphasis on some specific features. It’s not clear to me how the “linear region” of tanh/sigmoid interpretation has anything to do with that.

Also note that there are lots of different types of regularization and they don’t all work in the same way. E.g. Dropout is completely different in how it achieves its effect than L2 Regularization.


@paulinpaloalto Thanks for answering. I am referring to the video named
“Why Regularization Reduces Overfitting?” in the first week of the second course. Listen from 3:30 to 4:30.

(I do understand that other regularization techniques work differently)



Good question @G11 . I was wondering too. Did you find an answer to it?

Hi Usman_Abbas,

No, I was hoping that @paulinpaloalto would see the part that I am referring to in my answer. Let’s hope we will get an answer soon :slight_smile:

Here’s another recent thread on initialization that I think provides intuitions that are also relevant to this question. In both that case and the regularization case, we are reasoning about the effect of the magnitudes of the coefficients, so I think the reasoning there applies here as well.