Questions about regularization

Hi all. I had a question regarding regularization.

In chapter 2.3 of Week 1 (Regularizing your Neural Network - Why regularization reduces overfitting), Andrew Ng provides the following intuitive explanation:

By, reducing the size of absolute values that the weigh parameters (w) can reach, sigmoid or tanh type activation functions will work within the linear zone of sigmoid and tanh type activation functions.

My question is: Does regularization also work for ReLU type functions, given that they are entirely linear? And if yes, why do they work?

My second question. Is there a good source where I can dig deeper into the mathematics of of how and why regularization works?

Many thanks!

Keeping the absolute values of the weights at all layers “suppressed” is a good thing, because even if we use ReLU in the hidden layers, the output layer will be sigmoid (or softmax in the multiclass case) which means we still have to worry about the “flat tails” of the function. When the absolute values of the Z values at the output layer get too large, then the gradients approach zero. The values at all layers contribute to that.

In terms of going deeper into this mathematically, I have not personally tried to do that so I don’t have any direct references that I can give. Here’s a general bibliography thread about textbooks about ML/DL. I’ve heard that the Goodfellow, Bengio et al book is more mathematical. I just checked the ToC and they definitely have a chapter about Regularization.

Thanks for the answer and the bibliography thread.

Suppressing the weights before a layer using a sigmoid or softmax function makes sense to me.
But what what if the output is numeric (eg. predict house prices) rather than a classification?

thanks!

Paul would have more insight on this, but my first response would be (if house prices are your goal) then perhaps NN are not the best model to use ? Or throughout the DLS, other than where we reproduce logistic regression ‘with an NN mindset’, I can’t think of a case where we output a strictly continuous value.

I mean I suppose you could do it, you just make the buckets for your classification layer infinitely small, which means having a huge dense layer on your back end.

Perhaps someone else has a better suggestion as to how to do it.

*I think you can, but you’d chose your final layer not to be sigmoid, but actually ‘linear’ in shape. The big question here though is are you in fact performing better than doing a straight regression ?

If you have a regression problem where the output prediction is a continuous number like a house price, stock price, temperature and so forth, then you would either use ReLU or just the linear output with no activation at the output layer. We don’t really see examples of that type of network in DLS, though. The only one I can think of where that is a factor is YOLO (DLS C4 W3), where some of the outputs are classifications (object type) and some of them are regressions (bounding boxes around the objects). But that is a much more complex case and they don’t really go deep enough for us to really see how regularization would be applied in that kind of case.

I don’t have experience trying to train regression models of this type, so I haven’t seen how one would deal with overfitting in a case like that. A couple of thoughts though:

  1. Of course the cost function would be totally different if you are predicting a continuous number, instead of a “yes/no” or multiclass classification. Typically you would use a distance style loss function like MSE, although it’s not clear whether that has any effect on regularization.
  2. There are other regularization techniques like Dropout which work differently than the “weight suppression” style that you get from L2 regularization.

Maybe it’s worth a little google searching. It’s an interesting question and I’d have to believe it’s been considered by experts. I tried “how to handle overfitting when training a neural net for regression” and other than the ChatGPT response at the top, I got this article on the Kaggle site. It would also be worth taking a look at the ToC of the Goodfellow book. I’ll bet they cover regression models as well.

I had to dig back a bit as well on my Harvard ML studies-- And so there is regularization used in traditional settings as well:

https://rafalab.dfci.harvard.edu/dsbook-part-2/highdim/regularization.html

Or nowhere therein are neural nets invoked at all.

Would using them give you an advantage (?) I’m not sure-- one also has to consider the compute time/size of the model overall.

Yet it is always fun to experiment, so if you find something out let us know :grin:

*actually, it has been a little while since I looked into this, but I recall now-- ‘ridge regression’ should suffice.

1 Like

After a little more thought, there is another intuition about how L2 regularization works in addition to the one that initiated this conversation about suppressing the absolute values of the elements of Z at the output layer to avoid vanishing gradient problems from the flat tails of sigmoid. The other intuition that I remember Prof Ng discussing is that L2 regularization preferentially suppresses large weight values, which will have the effect of moderating the influence of specific individual input values in the results at any given layer of the network. In other words, L2 moderates the specific ability of single inputs affect the output, which is another way to mitigate overfitting. That effect would be relevant at all layers, not just the output layer, and should be applicable in either a regression application or a classification application. So even though the “flat tails” problem does not apply in the regression case, there is the potential for L2 to be useful there as well.

Of course Anthony’s point about experimentation applies in any case like this. We always need to fiddle with the relevant hyperparameters (\lambda in the L2 case) to get the effect we want and there is no a priori guarantee that it will be a sufficient solution.