Hello, how can we predict intuitively the the values of parameters w_j after regularization? We know that some features are more important/impactful on the y-value the model predicts; how does regularization conserve this influence/importance of certain features compared to other less importance features based on our training set?
Thanks in advance.
Hi there,
I think the best way intuitive-wise is to take a look at your loss function and the components it consists of. Ask yourself how large is lambda, so how „important“ is regularization in comparison with your performance goal:
The result afterwards is the output of your optimization problem on this very loss function. Independent of this: feature ranking might be a nice tool for you if you are interested in evaluating the importance of features: Permutation Importance vs Random Forest Feature Importance (MDI) — scikit-learn 1.6.1 documentation
This thread might be worth a look, too:
Best
Christian
Hello @renesultan,
I deleted my yesterday’s reply because I wanted to add some experiment results…
We can visualize it.
Consider a simple linear model with only one weight f(x) = wx and using the squared loss,
our cost function is: J_0 = \frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}-y^{(i)})^2
and regularized cost is: J = J_0 + \frac{\lambda}{2m}w^2
One important observation here is both J and the regularization term (\frac{\lambda}{2m}w^2) are 2nd degree polynomials of w, so they are parabolas. Note that a 1st degree polynomial is a line.
Now we plot them all J_0, \frac{\lambda}{2m}w^2 and J on a graph to see how the regularization affects the original cost J_0:
Green is original cost J_0
Blue is regularization \frac{\lambda}{2m}w^2
Red is regularized cost J = J_0 + \frac{\lambda}{2m}w^2
The optimal w for original & regularized cost are at the minimum points of the green & the red lines respectively, so from the graph, the optimal w become smaller after being regularized. The same idea applies to linear model with more than one weight. The key take-away here is adding regularization pushes the optimal w closer to 0.
The answer is, no, regularization generally doesn’t conserve that, except for that all features have the same variance and they are all uncorrelated with each other which is rare in our real world data.
However, if you only care about the relative ordering (instead of the relative magnitude of weight size), then depending on the regularization parameter lambda and the variances and correlations between features, you might see that some (or most) orderings are preserved. I did an experiment with a dataset of 100 features, and plot a histogram of ranking shift over different choices of lambda and number of features involved.
When you have only 2 features, the ordering of the features is not changed. However, as we increase the number of features or/and the size of lambda, the histrogram spreads out which means there are more and more ranking shifts, and large ranking shifts.
Cheers!
Thank you so much for your answers, it makes much more sense now!
Hi @rmwkwok , could you please explain how is “relative ordering” and “ranking shift” defined in your experiment? I do not understand how the histogram support the statement " depending on the regularization parameter lambda and the variances and correlations between features, you might see that some (or most) orderings are preserved" ? Thank you very much!
Hello @liyu, thanks for the questions!
relative ordering
Initial ranks: fit the optimal weights without regularization, and rank them.
ranking shift: fit the optimal weights with regularization used, then rank them again, and get the shift by subtracting the new rank from the initial rank for each weight. Each weight has a shift (zero, positive or negative), and I plot the distribution of the shift.
dependence on regularization parameter, variances of features and correlations
One way to see exactly this is by analyzing the problem with some maths.
Let’s consider the case of y=w_1x_1 + w_2x_2, with regularization \lambda, and with the means of the features equal to zeros.
We consider the closed form solution. The combination of linear regression and squared loss that we have learnt in this course has an analytical way for us to find the optimal w in one step. Having said that, there is a greater reason for us to learn the numerical way - gradient descent - because it can apply to every problem. There is no closed form solution for almost every neural network you will come across. Also, calculating the closed form solution is computationally expensive as the size of X grows, and it can be infeasible to calculate when X is just too large, gradient descent, however, doesn’t have this problem.
Closed form solution for linear regression under squared loss:
\vec{w} = (X^TX - \lambda I)^{-1}X^T\vec{y}
where X and y are our dataset, and X is a m \times n matrix with m samples and n features, which is the convention used in our course. \vec{w} = (w_1, w_2) is our weight vector and what we need to solve for.
After spending a few minutes to work the maths out, the ratio of the two regularized weights is
R=\frac{w_1^r}{w_2^r} = \frac{w_1 - \lambda k k_1}{w_2 - \lambda k k_2}
Note that w_1^r is distinguished from w_1 with a r in the superscript denoting that the former is regularized, which is why if \lambda=0, the ratio falls back to the ratio of unregularized weights.
Here, the products \lambda kk_1 and \lambda kk_2 control whether the ratio increases or decreases, and in turns affect the ranking. If the ratio goes from R<1 to R>1 or vice versa, we see a ranking shift.
k is composed of the variances of features #1 (\sigma_1^2), feature #2 (\sigma_2^2), and the label y (\sigma_y^2).
k_1 and k_2 are composed of the variances again, and also the correlations between feature #1 and #2 (c_{12}), between feature #1 and the label (c_{1y}) and between feature #2 and the label (c_{2y}).
That’s why I said ranking shift depends on the regularization parameter (\lambda), variances of features and the correlations. The maths for more than 2 weights are more complicated. With new weights joining in, the idea is that those new variances and new correlations are also joining in the products \lambda k k_1 and \lambda k k_2, making things more complicated.
FYI:
k = \sqrt{\frac{\sigma_y^2}{\sigma_1^2 \sigma_2^2}}
k_1 = \sqrt{\frac{1}{\sigma_2^2}}\frac{c_{1y}}{1-c_{12}^2}
k_2 = \sqrt{\frac{1}{\sigma_1^2}}\frac{c_{2y}}{1-c_{12}^2}
Hello @rmwkwok , a big thank you for the detailed explanation !
The histogram and analysis for 2 features illustrate clear how the lambda and feature correlation&variance makes difference (shift) in the weights.
The regularization formula seems at first to punish all weights in the same way. So it is hard to get the intuition why it should be helpful for overfitting problem.
You are welcome @liyu. I am glad to hear the answer helped!
Raymond