The value of lambda is chosen the same for all the features. Wouldn’t it be better if different lambda values is chosen for different features?

Hey @Vaibhav_Sharma4,

Welcome to the community. I have never really given it much of a thought myself, but now that you ask this, let me try to convince you why that might be not a good choice. Let’s say that we have 100 weight parameters in our model (*let’s forget the bias for now just for easiness*). Now, if we choose different \lambda values for different weights, then instead of choosing a single \lambda, we will have to choose 100 different \lambda(s). As you might be already aware that we treat \lambda as a hyper-parameter of the model, tuning 1 parameter value is much more easier than tuning 100 different parameter values.

However, I am not even sure why I am giving the above argument , since mathematically choosing different \lambda values for different weights is same as using a single \lambda value for all the weights. Let’s say that we have 3 different weights w_1, w_2, w_3.

## Case-1

- Same lambda value \lambda
- So, the regularization term in the cost function is \lambda(w_1^2 + w_2^2 + w_3^2).

## Case-2

- Different lambda values \lambda * \lambda_1, \lambda * \lambda_2, \lambda * \lambda_3
- So the regularization term in the cost function is \lambda(\lambda_1 *w_1^2 + \lambda_2 *w_2^2 + \lambda_3 *w_3^2)
- But what’s stopping us from initializing the weights as \frac{w_1} {\sqrt{\lambda_1}}, \frac{w_2} {\sqrt{\lambda_2}} and \frac{w_3} {\sqrt{\lambda_3}} respectively.
- So, now the regularization term in the cost function is \lambda(\lambda_1 * \frac{w_1^2}{\lambda_1} + \lambda_2 * \frac{w_2^2}{\lambda_2} + \lambda_3 * \frac{w_3^2}{\lambda_3} ) which is nothing but \lambda(w_1^2 + w_2^2 + w_3^2).
- So, assuming that the weights will update accordingly, the 2 cases don’t have any difference as such.

If you have any contradictory example, please do share it with us. I hope this helps.

Cheers,

Elemento

Hello @Vaibhav_Sharma4,

If your features have very different scales, such as feature A ranges from -1000 to 1000 whereas feature B from -1 to 1. Without normalization, sharing the same lambda can be problematic because the lambda affects the step size of features’ weights but the step size should be adapted to the feature scales. This seems to motivate for having one lambda for one weight.

However, having one lambda for each weight can also be problematic because you can just have too many lambdas, and that will easily become unmanagable, for example, even when you just have a layer of 10 neurons that takes in 10 features, because they will give you 10*10 = 100 weights, or 100 lambdas to tune. When you build a NN for real data, it’s not uncommon to have at least 100,000 weights.

Therefore, the best is to normalize your features so that they scale similarly, such that even sharing one lambda won’t cause too much trouble.

It is not that keeping one lambda for one weight is destined to be bad, but sometimes it is not feasible to tune too many lambdas.

Raymond