In the first week, Prof. Ng explains why regularization works by showing that if the cost function “forces” the weights to be small, that also gives small z (z=W*a), which in turn keeps g(z) to be in the linear region if g(z)=tanH(z). I.e. we get a more linear model, and the analogy is the same I guess with the sigmoid. So far so good.
What I do not understand is why the regularization makes a difference if I use the ReLu. I mean, the ReLu is either linear or 0 regardless if z is “big” positive or negative, or “small” positive or negative.
I think this is “over reading” or (worst case) misinterpreting what Prof Ng said. Actually it would be good for me to watch it again. Do you have a reference to the point at which he said that? Which lecture and the time offset?
It sounds to me like he must have meant this as just one of the reasons that regularization can work. What I remember him saying about why having the weights be smaller in general (which is the basic mechanism of L2 regularization) is that it dampens the effect of the model putting too much emphasis on some specific features. It’s not clear to me how the “linear region” of tanh/sigmoid interpretation has anything to do with that.
Also note that there are lots of different types of regularization and they don’t all work in the same way. E.g. Dropout is completely different in how it achieves its effect than L2 Regularization.
@paulinpaloalto Thanks for answering. I am referring to the video named
“Why Regularization Reduces Overfitting?” in the first week of the second course. Listen from 3:30 to 4:30.
(I do understand that other regularization techniques work differently)
Here’s another recent thread on initialization that I think provides intuitions that are also relevant to this question. In both that case and the regularization case, we are reasoning about the effect of the magnitudes of the coefficients, so I think the reasoning there applies here as well.