Regularization: Intuition and Conservation of influence

rmwkwok · June 26, 2022, 4:29am

I deleted my yesterday’s reply because I wanted to add some experiment results…

We can visualize it.

Consider a simple linear model with only one weight f(x) = wx and using the squared loss,

our cost function is: J_0 = \frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}-y^{(i)})^2

and regularized cost is: J = J_0 + \frac{\lambda}{2m}w^2

One important observation here is both J and the regularization term (\frac{\lambda}{2m}w^2) are 2nd degree polynomials of w, so they are parabolas. Note that a 1st degree polynomial is a line.

Now we plot them all J_0, \frac{\lambda}{2m}w^2 and J on a graph to see how the regularization affects the original cost J_0:

Screenshot from 2022-06-26 11-26-33
Green is original cost J_0
Blue is regularization \frac{\lambda}{2m}w^2
Red is regularized cost J = J_0 + \frac{\lambda}{2m}w^2

The optimal w for original & regularized cost are at the minimum points of the green & the red lines respectively, so from the graph, the optimal w become smaller after being regularized. The same idea applies to linear model with more than one weight. The key take-away here is adding regularization pushes the optimal w closer to 0.

The answer is, no, regularization generally doesn’t conserve that, except for that all features have the same variance and they are all uncorrelated with each other which is rare in our real world data.

However, if you only care about the relative ordering (instead of the relative magnitude of weight size), then depending on the regularization parameter lambda and the variances and correlations between features, you might see that some (or most) orderings are preserved. I did an experiment with a dataset of 100 features, and plot a histogram of ranking shift over different choices of lambda and number of features involved.

When you have only 2 features, the ordering of the features is not changed. However, as we increase the number of features or/and the size of lambda, the histrogram spreads out which means there are more and more ranking shifts, and large ranking shifts.

Cheers!

Topic		Replies	Views
Doubt on Regularization Supervised ML: Regression and Classification week-module-3	8	150	June 3, 2024
How come penalising all wj terms help in reducing only non important wj parameters in regularization/gradient descent? Supervised ML: Regression and Classification week-module-3	3	479	February 16, 2023
Does Regularization affects all weights equally? Supervised ML: Regression and Classification week-module-3	2	446	May 28, 2023
Regularization : Do larger weights imply complex model? Advanced Learning Algorithms week-module-3	5	623	September 7, 2022
How to chose the right value for the regularization parameter? Supervised ML: Regression and Classification week-module-3	9	684	June 22, 2022

Regularization: Intuition and Conservation of influence

Related topics