Why regularization reduces overfitting?

manueljua · November 25, 2025, 7:48pm

In neural networks, we often use activation functions like tanh, sigmoid and ReLU. Andrew Ng’s lecture highlighted how regularization pushes weights towards zero, and for the tanh function, this means the output gets closer to its linear region near zero. This, in turn, makes the surface solution smoother which avoids overfitting. So, does this mean that if we use only ReLU activations functions for the entire network, given the linear behaviour of the fucntion when z>0, the whole model behaves more like a linear model?

paulinpaloalto · November 25, 2025, 8:48pm

Well, there are a number of different regularization algorithms. L2 regularization has the general effect of pushing weights to be smaller in absolute value, but other algorithms like dropout do not have that direct effect. Perhaps in some cases dropout would end up having that effect, but it gets there by a different route.

Also note that ReLU is still a non-linear function, even though it is linear for z > 0. When you compose non-linear functions (e.g. the layers of a NN) the non-linearity compounds. So even if you use ReLU as the activation on all the layers other than the output layer, the intent is not to make it a linear model. If a linear model could solve the problem, we wouldn’t need a neural network, right? Logistic Regression would suffice.

In general, what regularization does is suppress the specificity of the model to some degree (that degree, of course, is dependent on the hyperparameters you choose: \lambda in the L2 case or “keep prob” in the dropout case) in the hopes that will make it more applicable to the general input data it needs to handle, as opposed to the specific training data set that is being used.

You don’t mention which course (if any) you are taking here and you created this thread in a generic “AI Discussions” category. The place in the DLAI courses that sounds the most directly applicable to the questions you are asking is DLS Course 2 Week 1. In the lectures there, Professor Andrew discusses all these points in some detail. You can listen to the lectures in “audit” mode on Coursera and it won’t cost any money. It would be worth taking a look at the lectures in DLS C2 W1 on handling overfitting to hear what he has to say. You can probably find them on YouTube for that matter, if you prefer that route.

rmwkwok · November 26, 2025, 12:43am

I think an immediate question to ask ourselves is, while regularization pushes it to the linear region of tanh by making weights small, does regularization pushes it to the linear region of ReLU by any means? Does regularization pushes it to z > 0 (without touching z < 0) by any means?

Topic		Replies	Views
Week 1 - why regularization works with ReLu Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	711	January 14, 2022
Why using ReLU doesn't result in high bias error? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	567	April 25, 2022
Question on how regularization helps by making networks closer to linear Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	552	April 25, 2021
Questions about regularization Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	6	55	July 13, 2024
How does regularization work on layer with activation "relu" in neural network? Advanced Learning Algorithms week-module-3	3	672	January 4, 2023

Why regularization reduces overfitting?

Related topics