In neural networks, we often use activation functions like tanh, sigmoid and ReLU. Andrew Ng’s lecture highlighted how regularization pushes weights towards zero, and for the tanh function, this means the output gets closer to its linear region near zero. This, in turn, makes the surface solution smoother which avoids overfitting. So, does this mean that if we use only ReLU activations functions for the entire network, given the linear behaviour of the fucntion when z>0, the whole model behaves more like a linear model?
Well, there are a number of different regularization algorithms. L2 regularization has the general effect of pushing weights to be smaller in absolute value, but other algorithms like dropout do not have that direct effect. Perhaps in some cases dropout would end up having that effect, but it gets there by a different route.
Also note that ReLU is still a non-linear function, even though it is linear for z > 0. When you compose non-linear functions (e.g. the layers of a NN) the non-linearity compounds. So even if you use ReLU as the activation on all the layers other than the output layer, the intent is not to make it a linear model. If a linear model could solve the problem, we wouldn’t need a neural network, right? Logistic Regression would suffice.
In general, what regularization does is suppress the specificity of the model to some degree (that degree, of course, is dependent on the hyperparameters you choose: \lambda in the L2 case or “keep prob” in the dropout case) in the hopes that will make it more applicable to the general input data it needs to handle, as opposed to the specific training data set that is being used.
You don’t mention which course (if any) you are taking here and you created this thread in a generic “AI Discussions” category. The place in the DLAI courses that sounds the most directly applicable to the questions you are asking is DLS Course 2 Week 1. In the lectures there, Professor Andrew discusses all these points in some detail. You can listen to the lectures in “audit” mode on Coursera and it won’t cost any money. It would be worth taking a look at the lectures in DLS C2 W1 on handling overfitting to hear what he has to say. You can probably find them on YouTube for that matter, if you prefer that route.
I think an immediate question to ask ourselves is, while regularization pushes it to the linear region of tanh by making weights small, does regularization pushes it to the linear region of ReLU by any means? Does regularization pushes it to z > 0 (without touching z < 0) by any means?