Course 2, week 1 : Regularization Doubt

In L2 regularization, with high lambda, z values become small, and they correspond to linear part of tanh activation function. The linear part of activation function reduces any complex NN into a Linear output model, and hence it’s not a very efficient technique. Why isn’t this logic extended to reLU ? it’s linear for all positive Z values? how come it is effective?

There are several things to consider here:

  1. It may well be that “high” values of \lambda don’t give the best solution in any given case. The \lambda value needs to be tuned to the specific case that you have. Of course what is high and what is low is all relative in any case.

  2. Suppressing the magnitudes of the coefficients using L2 regularization may have other effects besides controlling where in the domain of your activation function the values end up. Maybe in some cases L2 works by decreasing the relative influence of specific features or inputs on the result.

  3. tanh is non-linear. There is no part of its domain over which the graph is a straight line. So reducing the magnitudes of the z values does not eliminate non-linearity. In the ReLU case, there is nothing about L2 that makes it more likely that the z values will be positive, right? You’re just reducing their magnitudes in general, so ReLU still provides non-linearity.

1 Like