If you increase the regularization parameter lambda (λ), how would it lead to weights approximately equal to 0? I’m confused here. Any explanation regarding this would be helpful.

Thanks

Because you can only decrease the sizes of the weights in order to decrease the regularization’s contribution to the cost.

The regularization term is based on the product of lambda and the sum of the squares of the weights.

So the larger the weight values are, the higher the cost is going to be, regardless of how well the hypothesis fits the data set.

Recall that we’re trying to find the weights that give the minimum cost.

This creates an incentive for the system to learn smaller weight values, since this will reduce the additional cost from the regularization term

In the worst-case, if lambda is extremely big, this will force the system to learn very small weight values (approaching zero).

Sorry, I’m asking how it actually reduces cost and if we assign a large value to lambda (λ), how this will force the system to learn very small weight values. Any demonstration, example or extra material that can prove this concept?

Thanks

Hi @TMosh, can you please clarify the last line?

How does it accomplish this to learn small weight values when we have large values of lambda?

Thanks

Thanks, Raymond, the graph helps to understand it. What are the ideal values for lambda to use when using regularization? Also, I tried a lot of values to set for lambda, it works perfectly unless there is a case when I set its value to more than 100000, then it results in weights to very large values making cost “nan”. But when <10000, it works fine. Any tip regarding this? Thanks!

Hi @Muhammad-Kalim-Ullah, it depends on the problem but generally, I am afraid you will need to try it out. I usually try a few typical values such as 0.0001, 0.001, 0.01, 0.1, 1 and so on to see if the model will overfit or not. As soon as it does not overfit, I won’t use any larger value.

Thanks for your feedback, it really helps.

Generally I use a sequence of 1:3:10 values, because it approximates a log distribution and covers a large range in a few tries.

So I’d use a sequence like: 0.01, 0.03, 0.1, 0.3, 1.0 … and continue as low or high as seems useful.

The best lambda value tends to not be very precise, as the variation in the cost doesn’t have a strong peak when you test it with a validation or test set.

The goal is to set a lambda value that minimizes the cost on a validation or test set.

Today I watched the video “Cost function with regularization” and how regularization actually works.

I don’t get it: what is the relation of increasing the regularization parameter λ to a very high number and decreasing the parameters w1,w2,…wn?

I know that I can avoid overfitting if I decrease the parameters w1,w2,…wn to get a less wiggly model.

If I increase the regularization parameter λ to a very high number I have to decrease the parameters w1,w2,…wn in order to have the regularization term to be 0 or small because we don’t want high costs?

Do I set w1,w2,…wn to a tiny number?

Do I set the regularization parameter λ to a high number?

Do I set the regularization parameter λ or do I set the parameters w1,w2,…wn first?

I don’t know what causes what?

Hello Daniel @Daniel_Blossey,

If we want to deal with overfitting:

Yes.

We set λ which is a hyper-parameter. The parameters w1, w2, … are tuned and optimized by optimizer like gradient descent, and not by us.

Setting a large λ gives the optimizer more incentives to push w1, w2, … down.

Cheers,

Raymond

Dear Raymond @rmwkwok,

thank you very much for your answers!

Is it true that I can set λ to a large number which results in a higher regularization term and ultimately in higher costs if the algorithm does not decrease the parameters w1,w2…wn?

So, the algorithm or optimizer such as gradient descent can be motivated to always find the lowest possible costs by decreasing the parameters w1,w2…wn if I set λ to a large number?

Thanks again and greetings to Hong Kong!

Hello Daniel,

The optimize does one thing: tune the weights (w1, w2…) in the hope that the cost will go down. It does the same thing no matter what λ is.

The algorithm **will** try to decrease the weights, but I cannot guarantee if the resulting cost will be higher or lower compared to another training starting with a smaller λ.

Always, no matter what λ is.

Usually the process is like this, we start with a set of randomly initialized weights, then we have a choice to make - what the value of λ is. Then the optimizer will tune the weights to go down the hill trying to find a place where the cost is smaller.

A larger λ is essentially changing the landscape of the cost space so that everywhere is elevated and there will be higher mountains at large w range. The optimizer has no other option but to move towards smaller w to find a valley it can get to.

I have been to Munich for 2 times, and I like the Alps.

Raymond