Large value of lambda in Regularization

Muhammad-Kalim-Ullah · August 7, 2022, 7:19pm

If you increase the regularization parameter lambda (λ), how would it lead to weights approximately equal to 0? I’m confused here. Any explanation regarding this would be helpful.
Thanks

rmwkwok · August 7, 2022, 8:49pm

Because you can only decrease the sizes of the weights in order to decrease the regularization’s contribution to the cost.

TMosh · August 7, 2022, 8:59pm

The regularization term is based on the product of lambda and the sum of the squares of the weights.

So the larger the weight values are, the higher the cost is going to be, regardless of how well the hypothesis fits the data set.

Recall that we’re trying to find the weights that give the minimum cost.

This creates an incentive for the system to learn smaller weight values, since this will reduce the additional cost from the regularization term

In the worst-case, if lambda is extremely big, this will force the system to learn very small weight values (approaching zero).

Muhammad-Kalim-Ullah · August 8, 2022, 8:08am

Sorry, I’m asking how it actually reduces cost and if we assign a large value to lambda (λ), how this will force the system to learn very small weight values. Any demonstration, example or extra material that can prove this concept?
Thanks

Muhammad-Kalim-Ullah · August 8, 2022, 8:13am

Hi @TMosh, can you please clarify the last line?

How does it accomplish this to learn small weight values when we have large values of lambda?
Thanks

rmwkwok · August 8, 2022, 8:25am

You may check this out for an explanation with a graph.

Muhammad-Kalim-Ullah · August 8, 2022, 1:28pm

Thanks, Raymond, the graph helps to understand it. What are the ideal values for lambda to use when using regularization? Also, I tried a lot of values to set for lambda, it works perfectly unless there is a case when I set its value to more than 100000, then it results in weights to very large values making cost “nan”. But when <10000, it works fine. Any tip regarding this? Thanks!

rmwkwok · August 8, 2022, 1:33pm

Hi @Muhammad-Kalim-Ullah, it depends on the problem but generally, I am afraid you will need to try it out. I usually try a few typical values such as 0.0001, 0.001, 0.01, 0.1, 1 and so on to see if the model will overfit or not. As soon as it does not overfit, I won’t use any larger value.

Muhammad-Kalim-Ullah · August 8, 2022, 2:19pm

Thanks for your feedback, it really helps.

rmwkwok · August 8, 2022, 2:21pm

You are welcome @Muhammad-Kalim-Ullah

TMosh · August 8, 2022, 7:53pm

Generally I use a sequence of 1:3:10 values, because it approximates a log distribution and covers a large range in a few tries.
So I’d use a sequence like: 0.01, 0.03, 0.1, 0.3, 1.0 … and continue as low or high as seems useful.

The best lambda value tends to not be very precise, as the variation in the cost doesn’t have a strong peak when you test it with a validation or test set.

The goal is to set a lambda value that minimizes the cost on a validation or test set.

Daniel_Blossey · December 5, 2022, 4:14pm

Today I watched the video “Cost function with regularization” and how regularization actually works.
I don’t get it: what is the relation of increasing the regularization parameter λ to a very high number and decreasing the parameters w1,w2,…wn?

I know that I can avoid overfitting if I decrease the parameters w1,w2,…wn to get a less wiggly model.

If I increase the regularization parameter λ to a very high number I have to decrease the parameters w1,w2,…wn in order to have the regularization term to be 0 or small because we don’t want high costs?
Do I set w1,w2,…wn to a tiny number?
Do I set the regularization parameter λ to a high number?
Do I set the regularization parameter λ or do I set the parameters w1,w2,…wn first?
I don’t know what causes what?

rmwkwok · December 6, 2022, 2:46am

Hello Daniel @Daniel_Blossey,

If we want to deal with overfitting:

Yes.

We set λ which is a hyper-parameter. The parameters w1, w2, … are tuned and optimized by optimizer like gradient descent, and not by us.

Setting a large λ gives the optimizer more incentives to push w1, w2, … down.

Cheers,
Raymond

Daniel_Blossey · December 6, 2022, 8:26am

Dear Raymond @rmwkwok,

thank you very much for your answers!

Is it true that I can set λ to a large number which results in a higher regularization term and ultimately in higher costs if the algorithm does not decrease the parameters w1,w2…wn?
So, the algorithm or optimizer such as gradient descent can be motivated to always find the lowest possible costs by decreasing the parameters w1,w2…wn if I set λ to a large number?

Thanks again and greetings to Hong Kong!

rmwkwok · December 6, 2022, 8:48am

Hello Daniel,

The optimize does one thing: tune the weights (w1, w2…) in the hope that the cost will go down. It does the same thing no matter what λ is.

The algorithm will try to decrease the weights, but I cannot guarantee if the resulting cost will be higher or lower compared to another training starting with a smaller λ.

Always, no matter what λ is.

Usually the process is like this, we start with a set of randomly initialized weights, then we have a choice to make - what the value of λ is. Then the optimizer will tune the weights to go down the hill trying to find a place where the cost is smaller.

A larger λ is essentially changing the landscape of the cost space so that everywhere is elevated and there will be higher mountains at large w range. The optimizer has no other option but to move towards smaller w to find a valley it can get to.

I have been to Munich for 2 times, and I like the Alps.

Raymond

Topic		Replies	Views
Why is the value of regularization parameter(lambda) the same for all the weight parameters Supervised ML: Regression and Classification week-module-3	3	521	July 28, 2022
Question on how Lambda works Supervised ML: Regression and Classification week-module-3	9	506	February 22, 2023
Will Lambda reduce the size of the w parameters? Supervised ML: Regression and Classification week-module-3	7	497	May 6, 2023
Regularization, lambda/m Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	563	December 21, 2021
Doubt about lambda Supervised ML: Regression and Classification week-module-2	2	510	July 25, 2022

Large value of lambda in Regularization

Related topics