Week 1 - why regularization works with ReLu

G11 · December 21, 2021, 1:47pm

Hi,

In the first week, Prof. Ng explains why regularization works by showing that if the cost function “forces” the weights to be small, that also gives small z (z=W*a), which in turn keeps g(z) to be in the linear region if g(z)=tanH(z). I.e. we get a more linear model, and the analogy is the same I guess with the sigmoid. So far so good.
What I do not understand is why the regularization makes a difference if I use the ReLu. I mean, the ReLu is either linear or 0 regardless if z is “big” positive or negative, or “small” positive or negative.

I obviously miss something. Thanks.

paulinpaloalto · December 21, 2021, 11:33pm

I think this is “over reading” or (worst case) misinterpreting what Prof Ng said. Actually it would be good for me to watch it again. Do you have a reference to the point at which he said that? Which lecture and the time offset?

It sounds to me like he must have meant this as just one of the reasons that regularization can work. What I remember him saying about why having the weights be smaller in general (which is the basic mechanism of L2 regularization) is that it dampens the effect of the model putting too much emphasis on some specific features. It’s not clear to me how the “linear region” of tanh/sigmoid interpretation has anything to do with that.

Also note that there are lots of different types of regularization and they don’t all work in the same way. E.g. Dropout is completely different in how it achieves its effect than L2 Regularization.

G11 · December 22, 2021, 8:21am

@paulinpaloalto Thanks for answering. I am referring to the video named
“Why Regularization Reduces Overfitting?” in the first week of the second course. Listen from 3:30 to 4:30.

(I do understand that other regularization techniques work differently)

Thanks

/G

Usman_Abbas · January 7, 2022, 9:14pm

Good question @G11 . I was wondering too. Did you find an answer to it?

G11 · January 14, 2022, 8:07am

Hi Usman_Abbas,

No, I was hoping that @paulinpaloalto would see the part that I am referring to in my answer. Let’s hope we will get an answer soon

paulinpaloalto · January 14, 2022, 3:51pm

Here’s another recent thread on initialization that I think provides intuitions that are also relevant to this question. In both that case and the regularization case, we are reasoning about the effect of the magnitudes of the coefficients, so I think the reasoning there applies here as well.

Topic		Replies	Views
Questions about regularization Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	6	37	July 13, 2024
Course 2, week 1 : Regularization Doubt Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	505	May 12, 2022
Week 1 regularization justification - b isn't small, why ignore it? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	507	November 26, 2022
DL and NN course1 Week#3: Understanding Activation functions Neural Networks and Deep Learning week-module-3 , coursera-platform	2	33	March 4, 2025
Question on how regularization helps by making networks closer to linear Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	549	April 25, 2021

Week 1 - why regularization works with ReLu

Related topics