Hi all. I had a question regarding regularization.
In chapter 2.3 of Week 1 (Regularizing your Neural Network - Why regularization reduces overfitting), Andrew Ng provides the following intuitive explanation:
By, reducing the size of absolute values that the weigh parameters (w) can reach, sigmoid or tanh type activation functions will work within the linear zone of sigmoid and tanh type activation functions.
My question is: Does regularization also work for ReLU type functions, given that they are entirely linear? And if yes, why do they work?
My second question. Is there a good source where I can dig deeper into the mathematics of of how and why regularization works?
Keeping the absolute values of the weights at all layers âsuppressedâ is a good thing, because even if we use ReLU in the hidden layers, the output layer will be sigmoid (or softmax in the multiclass case) which means we still have to worry about the âflat tailsâ of the function. When the absolute values of the Z values at the output layer get too large, then the gradients approach zero. The values at all layers contribute to that.
In terms of going deeper into this mathematically, I have not personally tried to do that so I donât have any direct references that I can give. Hereâs a general bibliography thread about textbooks about ML/DL. Iâve heard that the Goodfellow, Bengio et al book is more mathematical. I just checked the ToC and they definitely have a chapter about Regularization.
Thanks for the answer and the bibliography thread.
Suppressing the weights before a layer using a sigmoid or softmax function makes sense to me.
But what what if the output is numeric (eg. predict house prices) rather than a classification?
Paul would have more insight on this, but my first response would be (if house prices are your goal) then perhaps NN are not the best model to use ? Or throughout the DLS, other than where we reproduce logistic regression âwith an NN mindsetâ, I canât think of a case where we output a strictly continuous value.
I mean I suppose you could do it, you just make the buckets for your classification layer infinitely small, which means having a huge dense layer on your back end.
Perhaps someone else has a better suggestion as to how to do it.
*I think you can, but youâd chose your final layer not to be sigmoid, but actually âlinearâ in shape. The big question here though is are you in fact performing better than doing a straight regression ?
If you have a regression problem where the output prediction is a continuous number like a house price, stock price, temperature and so forth, then you would either use ReLU or just the linear output with no activation at the output layer. We donât really see examples of that type of network in DLS, though. The only one I can think of where that is a factor is YOLO (DLS C4 W3), where some of the outputs are classifications (object type) and some of them are regressions (bounding boxes around the objects). But that is a much more complex case and they donât really go deep enough for us to really see how regularization would be applied in that kind of case.
I donât have experience trying to train regression models of this type, so I havenât seen how one would deal with overfitting in a case like that. A couple of thoughts though:
Of course the cost function would be totally different if you are predicting a continuous number, instead of a âyes/noâ or multiclass classification. Typically you would use a distance style loss function like MSE, although itâs not clear whether that has any effect on regularization.
There are other regularization techniques like Dropout which work differently than the âweight suppressionâ style that you get from L2 regularization.
Maybe itâs worth a little google searching. Itâs an interesting question and Iâd have to believe itâs been considered by experts. I tried âhow to handle overfitting when training a neural net for regressionâ and other than the ChatGPT response at the top, I got this article on the Kaggle site. It would also be worth taking a look at the ToC of the Goodfellow book. Iâll bet they cover regression models as well.
After a little more thought, there is another intuition about how L2 regularization works in addition to the one that initiated this conversation about suppressing the absolute values of the elements of Z at the output layer to avoid vanishing gradient problems from the flat tails of sigmoid. The other intuition that I remember Prof Ng discussing is that L2 regularization preferentially suppresses large weight values, which will have the effect of moderating the influence of specific individual input values in the results at any given layer of the network. In other words, L2 moderates the specific ability of single inputs affect the output, which is another way to mitigate overfitting. That effect would be relevant at all layers, not just the output layer, and should be applicable in either a regression application or a classification application. So even though the âflat tailsâ problem does not apply in the regression case, there is the potential for L2 to be useful there as well.
Of course Anthonyâs point about experimentation applies in any case like this. We always need to fiddle with the relevant hyperparameters (\lambda in the L2 case) to get the effect we want and there is no a priori guarantee that it will be a sufficient solution.