I’m revising week 1 lecture called “Why Regularization Reduces Overfitting?” where Andrew explains the intuition for tanh activation function and he says:
if the regularization becomes very large, the parameters W very small, so Z will be relatively small
so the activation function will be relatively linear
and so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function
I wonder, why don’t we have the same effect for ReLU activation function. It’s linear for all z > 0, not only for small values of z.
It is almost linear, but non-linear enough. I think this (3.1 Rectifier Neurons) is a really good explanation of why it works:
… the only non-linearity in the network comes from the path selection associated with individual neurons being active or not. For a given input only a subset of neurons are active. Computation is linear on this subset: once this subset of neurons is selected, the output is a linear function of the input (although a large enough change can trigger a discrete change of the active set of neurons). The function computed by each neuron or by the network output in terms of the network input is thus linear by parts. We can see the model as an exponential number of linear models that share parameters (Nair and Hinton, 2010).