Question on how regularization helps by making networks closer to linear

In week2 Andrew mentioned that one way in which Frobenius regularization helps is that they make w small and z closer to zero. This makes the layer closer to linear when the activation function is tanh.

My question is that we don’t just apply Frobenius regularization for layers with tanh activation functions. When we use sigmoid, small values would result in a non-linear layer. Wouldn’t that have the opposite affect of what we want?

The shapes of tanh and sigmoid are pretty similar. In fact you can show that tanh is just a scaled and shifted sigmoid. They both have quasi-linear regions in the center of the graphs (for input values near 0). But I don’t think the point of regularization is necessarily to get us into relatively linear regions of the various activation functions. The point of suppressing the values of the weights is that it prevents any one input (at any given layer) from having an outsized influence on the results. It “regularizes” things by “evening out” the influences of the various neurons. I believe that this is the intuition for why regularization prevents overfitting: it prevents any one input at a given layer from dominating the results by having a particularly large coefficient value (corresponding weight).

1 Like