In the video giving a justification for why regularization works, Andrew Ng repeatedly says “ignoring b for the moment”, when suggesting that keeping the weights small will lead to behavior in the linear regime of the tanh function. But that is only true if b is small, and since we aren’t regularizing b, then there’s no reason for a small b. In which case, this justification falls apart. Is there a justification that takes into account the existence of sizeable b offsets? Or do we just say, “well, regularization works well in practice and there isn’t a good intuition about why.”

I’m sure you can conclude that the experimental evidence suggests that Prof Ng’s method is correct, otherwise he wouldn’t be telling us about it. All this is “prior art”, of course.

But there must be some intuition available here as well. What’s going on in a Neural Network is pretty complicated, but if you look at any one neuron in any given layer, the weights and bias values define an “affine transformation”, right?

z = w \cdot a + b

So think about the geometric interpretation of that: in the simplest case of Logistic Regression, we take that transformation and set it equal to 0 to get the “decision boundary” between yes and no answers. Well, what we really do is set `sigmoid(z) = 0.5`

, but that’s the same thing. That equation gives us a hyperplane in the input space and the w vector defines the normal vector to the hyperplane and the bias term b determines the distance from the origin. So maybe the intuition is that the normal vector is more critical to controlling the behavior of the transformation than the distance from the origin? Or that in terms of getting the optimal solution, it suffices to constrain only the normal vector and let the bias react naturally. Or that clearly the normal vector is a bigger deal, because it has n degrees of freedom, whereas (considering only a single neuron) there is only one degree of freedom for the bias.

Of course (as mentioned above), this is just the bare bones simplest case and what’s really happening is a different affine transformation on every neuron in every layer composed with non-linear activations at every layer. So who knows if that intuition really generalizes. But apparently the experimental results justify the decision …

Thanks for the reply. It seemed that Prof. Ng’s justification centered on the idea that keeping z close to zero keeps it in the linear regime of the tanh function, where the gradient information will be largest. He more or less says exactly that. He then says that in order to keep z close to zero, we want to keep w small, “ignoring b for the moment”.

You offer intuitions about why the normal vector (i.e. w) is perhaps more important than b in general. Thanks, those are good arguments. Given that w is more important, there is still the question of why regularization of w is effective, even when b is not small. If b is not small, the idea that small w produces small z (and therefore linear tanh input regime) doesn’t work.

I have the greatest respect for Dr. Ng and I’m certainly not questioning the validity of anything he says. I’m certain regularization is a bedrock principle and very well established. I’m just looking to understand a little more deeply beyond “ignoring b for the moment”.

It’s been a while since I listened to those lectures, so maybe I really should go back and watch again. But my guess is that you are conflating different cases here. Why would keeping z in the linear region of tanh have anything to do with overfitting, which is what regularization is addressing? That would be relevant for avoiding vanishing gradients, but that’s a completely different point, right?

I think at some point in the regularization section he says that the intuition for why suppressing the magnitude of the weights in general is useful for reducing overfitting is that it prevents particular individual features from having impact on the results that is in some way “out of proportion” to their real importance. Notice that the bias values are not associated with particular inputs, right? It’s only the weights that have that property. That’s the intuition that stuck with me for L2 regularization.

Thanks, here’s a brief excerpt from the transcript with the relevant discussion:

From “Why Regularization Reduces Overfitting”

" So just to summarize, if the regularization parameters are very large, the parameters W very small, so z will be relatively small, kind of ignoring the effects of b for now, but so z is relatively, so z will be relatively small, or really, I should say it takes on a small range of values. And so the activation function if it’s tanh, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function, which is therefore, pretty simple function, rather than a very complex highly non-linear function. And so, is also much less able to overfit, ok?"

I guess I was just hung up on the “ignoring b for now”, and hoping there was something more that addressed that. There doesn’t seem to be, so I have a different intuition. “Weight decay” is proportional to the size of the weight, so the overall effect is to reduce the variation in the size of the weights. This helps keep the causal influence of the preceding layer from being concentrated in just a few weights, maintaining the “effective size” of the net. This will be true regardless of b. Also, even without “ignoring b”, it is still the case that large weights will often (not always) push Z into the nonlinear regime. So the intuition Prof. Ng offers may still apply in a statistical sense, even considering b.

I appreciate the dialogue here; it helps me develop my own thinking about what would at first seem to be a very basic topic. (Sometimes “very basic” also means “taken for granted”).

Hello @Doug_Cutrell,

I think it is interesting to inspect this slide again:

Andrew said when w is small and b = 0, we are in the linear range which is more or less simplifying our NN to a linear regression. Then we can ask ourselves this: if b is very large, then where will we be?

We will be in the plateau of the tanh, which will make the layer output a constant. If staying in the linear range will somehow make our NN like a linear regression, then I think staying in the plateau will even be more simplifying.

Then I think the challenge will become: what if b is neither too large nor too small. I think one takeaway from this video is that, in the spirit of countering overfitting, we don’t want our NN to fully leverage the whole range of tanh. By reducing w, we achieve that purpose. And if b is neither too large nor too small, then NN will be able to access more (non-linear range) of the tanh. However, one thing for sure is that, it can only leverage half of it, because for b > 0, it has more access only to the upper tanh but not the lower tanh. Ofcourse, to verify this answer, we probably want to do an experiment like this: train 2 NNs without regularization and without bias, one using tanh and one using a modified tanh and compare their performance. The modified tanh is that when x > 0, it is our tanh, but when x < 0, it is linear.

Cheers,

Raymond