I just watched the video on regularization, and I noticed that the same lambda is applied to every layer. I am wondering if anyone does hyperparameter tuning with different lambdas for different layers, or sets some kind of standard (e.g. inversely related to Frobenius norm size) to weigh them differently?
It seems reasonable to me because without different weights, the regularization term would be dominated by larger layers. E.g. if one layer is 100 into 20 and another is 20 into 4, the former would contribute 25 times more elements to the regularization total. Maybe there is a mathematical reason for why that is actually good, but it is not obvious to me
Hi @zer2 and welcome to Discourse. I guess that technically you could assign different lambdas to the different layers in your model. However, for the purpose of regularizing the values of weights, and prevent overfitting, one value of lambda that controls a balance of how much you (over) fit your data usually suffice. Note also that more lambdas would mean more hyper-parameters to tune, without a clear benefit in terms of the desired result of preventing overfitting.
Note also, by the way, that regularization is a common practice in other types of optimization problems, and also for those problems, regardless of how many coefficients are optimized, the model implements one regularization term for all.
Thank you for the response, and the welcome!
Cross-validating another L-1 parameters does sound impractical for any network of decent size.
If you wouldn’t mind, what are the other optimization algorithms that only implement one lambda in an equivalent way? Certainly only one lambda is used for regularized linear regression, but practitioners scale the input features so that each contributes equally. I guess in theory an algorithm like xgboost could have a lambda that changes by depth of the tree, or over the number of estimators. However that would have each split/tree solving a different problem, which seems inelegant. I don’t know of other cases like neural networks where the model has distinct ‘layers’ that could have different lambdas
There are other domains where optimization with regularization is used. For example, sparse inversion is the field of seismic exploration. The idea is the build a mathematical model of the data and promote (in a Bayesian sense) sparsity, which reflects the physics of wave-propagation. The optimization problem is formulated as a minimization of the L2 residual between model and data. To that you add a regularizer of exactly the same form as the one added in deep learning models - lambda times the L2 of model coefficients (or L1 or L0 - each yields a different model with a somewhat different solver).
In any case, for this type of models only one lambda is used, even though the number of model parameters can amount to tens of thousands or more
Very cool!
I have even never heard of sparse inversion, but it sounds interesting. Might be a new rabbit hole for me haha
The bigger area here is actually compressed sensing. In many problems in signal processing you would like to retrieve a continuous representation of a signal based on few (potentially irregularly sampled) measurements. With the knowledge that the signal can be represented sparsely is some model domain you employ a regularized minimization problem with an L(0, 1, 2) regularizer. You can draw many parallels between this area and models in deep learning. Read more if you’re interested in: Compressed sensing - Wikipedia
Thanks for the link!