How to chose the right value for the regularization parameter?

I’ve just finished course 1 in the specialization, and tried to implement a polynomial regressor with feature scaling and regularization.

I tried to fit the model in these input values:

    features = np.array([0.5, 3, 5, 12, 100])
    targets = np.array([-0.25, 6, 20, 132, 9900])

These values lie exactly on the curve x^2 - x.

Without regularization (regularization parameter = 0), I get this result:

Cost: 1.0057510780545684e-19
w: [-1.  1.]
b: -6.161826604511589e-10

This is good so far. But when I set the regularization term to 1, I get this result:

Cost: 79600.11697910336
w: [46.0003336   0.45965975]
b: -32.76440392902737

I’m okay with the cost going up, after all, we’re reducing overfitting, which should result in an increase in the cost function.
But, the weights?.. I would expect all weight to be smaller. Even though the weight of the term x^2 got smaller, the term for x got much bigger.

Is this a correct behavior? I suspect that since I’m doing feature scaling, the values are getting much smaller, which might have an impact.

If I set the regularization parameter to something much smaller like 0.001, I get this result:

Cost: 37.974732753837024
w: [1.59413222 0.97505558]
b: -11.740479905065058

I have a feeling that this is not correct, and that it’s a bug.

My code for gradient descent:

{Moderator’s Edit: Code Removed}

If this is not a bug, how is this a correct behavior? Why is one weight getting much bigger while the other is getting smaller? Why not the two together?

And, is there a way to make sense of the weights and determine whether the current value of the regularization parameter is good or not? I mean, is there a threshold or something between underfitting and overfitting that I should put in mind, or is it just by experimentation?

Hey @OsamaAhmad,
Welcome to the community. First of all, I am assuming that you are well aware of the fact that the regularization parameter can be any positive value. It’s not limited to the range [0, 1]. Just wanted to put it out there, cause I don’t recall if Prof. Andrew mentioned this explicitly or not.

Second, the best way to find out the right value of the regularization parameter is to perform hyper-parameter tuning, i.e., trying various values of it, and choosing the one corresponding to the best performance.

As for any bugs in the code, it would be great if you can share your entire code (DM; for that click on my name and select “Message”), instead of just the functions. That way, we can run the code ourselves and debug it if any bug is present.

P.S. - It is against the community guidelines to share any code that might be related to assignments. So, I will be removing the code shortly, once @rmwkwok has reviewed it. If any mentor needs to take a look at your code, he/she will ask you to DM it.

Regards,
Elemento

1 Like

Thanks @Elemento, I have done reviewing it. I am writing a reply now.

1 Like

Great try! @OsamaAhmad!

My answer in short, it’s possibly because your regularization parameters were not scaled accordingly, after the features are scaled.

I used your batch_gradient_descent to build a class for demo, and the code for my class is at the end of this post.

First, define X and y:

X = np.stack([features, features**2], axis=1)
y = targets

For the purpose of comparison, here is the result of sklearn’s implementation.

from sklearn.linear_model import LinearRegression, Ridge
reg = Ridge(alpha = 1.).fit(X, y)
reg.coef_, reg.intercept_
#RESULT
#(array([-0.78476702,  0.98451511]), -0.41302367274783336)

Below is my class using your batch_gradient_descent, without scaling the regularization factors, note that scale_alpha=False, and in the result, the first weight became larger than 1.

fit(X, y, alpha=1., scale_alpha=False, learning_rate=.01).run(num_iter=30000)
#RESULT
#[4.85976846 0.46432472] -6.160283575650695

Here is the same class but regularization factors scaled accordingly, now the result is very similar to sklearn’s.

fit(X, y, alpha=1., scale_alpha=True, learning_rate=.01).run(num_iter=30000)
#RESULT
#[-0.78455149  0.98449913] -0.4134160982165085
class fit:
    def __init__(self, xs, ys, alpha, scale_alpha = False, learning_rate = .01):
        self.w = np.array([-3,2.]) #initial w
        self.b = 0. #inital b
        
        self.X_mean = xs.mean(axis=0)
        self.X_std = xs.std(axis=0)
        
        self.xs = (xs-self.X_mean)/self.X_std
        self.ys = ys #shape (m)
        
        # scaling regularization parameters
        self.regularization_parameter = alpha/self.X_std**2 if scale_alpha else alpha

    def run(self, num_iter):
        for it in range(num_iter):
            self.batch_gradient_descent()
        print((self.w / self.X_std), self.b - (self.w*self.X_mean/self.X_std).sum())

    def batch_gradient_descent(self):
        # your implementation #
        pass
1 Like

Answered in my above demo.

Besides @Elemento 's answer, I would like to add that in course 2 week 3, Professor Andrew Ng will talk about how to choose the regularization parameter using a cross-validation dataset.

Cheers!

1 Like

Sorry for posting the code.
I tried to make the question as compact as I can by posting only the relevant parts of the code.
Anyways, I already figured out why I’m getting this result, thanks to @rmwkwok 's answer.
Thanks!

Thank you for the answer! This solves the problem.
I’m just a little confused about the way the regularization parameter is being scaled. Why are we dividing by the square of the standard deviation?

Sure! For example, let’s say our model assumption is y=wx+b , and before scaling, the cost function with L2 regularization is

J = \frac{1}{2m}\sum_{i=1}^{m}{(wx^{(i)}+b-y^{(i)})^2}+\frac{\lambda }{m}w^2

Now we want to rescale x by {x^{(i)}}' = \frac{x^{(i)} - x_{\text{mean}}}{x_{\text{std}}}, and by the following derivation we have the relationship between w and w', and b and b', and we just put them all into J to get

J = \frac{1}{2m}\sum_{i=1}^{m}{(w'{x^{(i)}}'+b'-y^{(i)})^2}+\frac{1}{m}\frac{\lambda}{x_{\text{std}}^2}{w'}^2

Now this J(w',b') is what we are optimizing for given the rescaled dataset x'

Cheers!

1 Like

Thank you for the answer!

J = \frac{1}{2m}\sum_{i=1}^{m}{(w'{x^{(i)}}'+b'-y^{(i)})^2}+\frac{\lambda }{m}(\frac{w'}{x_{\text{std}}^2})^2

Here, shouldn’t the last term be \frac{\lambda }{m}(\frac{w'}{x_{\text{std}}})^2? (in which x_{\text{std}} is not squred)
and as a result, it’s equal to \frac{\lambda }{m}(\frac{w'^2}{x_{\text{std}}^2}) = \frac{\lambda }{mx_{\text{std}}^2}{w'^2} = \frac{\lambda }{m}{w^2}

Thank you @OsamaAhmad ! It was my mistake. I have updated the formula in the original post, and yes I like your way of grouping the x_{\text{std}} with \lambda.

1 Like