Doubt about lambda

Vaibhav_Sharma4 · July 25, 2022, 10:07am

The value of lambda is chosen the same for all the features. Wouldn’t it be better if different lambda values is chosen for different features?

Elemento · July 25, 2022, 1:19pm

Hey @Vaibhav_Sharma4,
Welcome to the community. I have never really given it much of a thought myself, but now that you ask this, let me try to convince you why that might be not a good choice. Let’s say that we have 100 weight parameters in our model (let’s forget the bias for now just for easiness). Now, if we choose different \lambda values for different weights, then instead of choosing a single \lambda, we will have to choose 100 different \lambda(s). As you might be already aware that we treat \lambda as a hyper-parameter of the model, tuning 1 parameter value is much more easier than tuning 100 different parameter values.

However, I am not even sure why I am giving the above argument , since mathematically choosing different \lambda values for different weights is same as using a single \lambda value for all the weights. Let’s say that we have 3 different weights w_1, w_2, w_3.

Case-1

Same lambda value \lambda
So, the regularization term in the cost function is \lambda(w_1^2 + w_2^2 + w_3^2).

Case-2

Different lambda values \lambda * \lambda_1, \lambda * \lambda_2, \lambda * \lambda_3
So the regularization term in the cost function is \lambda(\lambda_1 *w_1^2 + \lambda_2 *w_2^2 + \lambda_3 *w_3^2)
But what’s stopping us from initializing the weights as \frac{w_1} {\sqrt{\lambda_1}}, \frac{w_2} {\sqrt{\lambda_2}} and \frac{w_3} {\sqrt{\lambda_3}} respectively.
So, now the regularization term in the cost function is \lambda(\lambda_1 * \frac{w_1^2}{\lambda_1} + \lambda_2 * \frac{w_2^2}{\lambda_2} + \lambda_3 * \frac{w_3^2}{\lambda_3} ) which is nothing but \lambda(w_1^2 + w_2^2 + w_3^2).
So, assuming that the weights will update accordingly, the 2 cases don’t have any difference as such.

If you have any contradictory example, please do share it with us. I hope this helps.

Cheers,
Elemento

rmwkwok · July 25, 2022, 10:47pm

Hello @Vaibhav_Sharma4,

If your features have very different scales, such as feature A ranges from -1000 to 1000 whereas feature B from -1 to 1. Without normalization, sharing the same lambda can be problematic because the lambda affects the step size of features’ weights but the step size should be adapted to the feature scales. This seems to motivate for having one lambda for one weight.

However, having one lambda for each weight can also be problematic because you can just have too many lambdas, and that will easily become unmanagable, for example, even when you just have a layer of 10 neurons that takes in 10 features, because they will give you 10*10 = 100 weights, or 100 lambdas to tune. When you build a NN for real data, it’s not uncommon to have at least 100,000 weights.

Therefore, the best is to normalize your features so that they scale similarly, such that even sharing one lambda won’t cause too much trouble.

It is not that keeping one lambda for one weight is destined to be bad, but sometimes it is not feasible to tune too many lambdas.

Raymond

Topic		Replies	Views
Why is the value of regularization parameter(lambda) the same for all the weight parameters Supervised ML: Regression and Classification week-3	3	520	July 28, 2022
Large value of lambda in Regularization Supervised ML: Regression and Classification week-3	14	974	December 6, 2022
Different lambda params for different layers Improving Deep Neural Networks: Hyperparameter tun	6	555	May 2, 2021
Question on how Lambda works Supervised ML: Regression and Classification week-3	9	505	February 22, 2023
Explanation of Lambda in Regularization of Linear Regression Cost Function Supervised ML: Regression and Classification week-3	2	106	July 21, 2024

Doubt about lambda

Case-1

Case-2

Related topics