Course 2, week 1 L1 > L2 norm increasing sparseness, why?

Doug_Cutrell · November 25, 2022, 5:23pm

Andrew Ng states that for logistic regularization, the L1 regularization is sometimes thought to increase sparseness more than L2 regularization, with some of the weights becoming zero. But I can’t think of any justification for thinking that. It seems that either one would be equally likely to make some weights become zero.

Is there a way to justify this idea, that L1 producing more sparseness than L2?

paulinpaloalto · November 25, 2022, 5:46pm

I’m sure you can google up a more complete explanation, but here are some thoughts:

L1 regularization uses the sum of the absolute values of weights, whereas L2 uses the squares of the weights, of course. These are being added as effectively “penalty terms” to the cost function to try to decrease the general magnitude of the weights in the hopes that will reduce the overfitting. So L1 punishes small weights exactly to the same degree it punishes large values, whereas the quadratic penalty is more sophisticated: it punishes small values more lightly, but punishes large values much more severely. E.g. if a w value is 0.1, then the L1 penalty is 0.1, but the L2 penalty is 0.01. And if w is 0.01, then L1 is 0.01 and L2 is 0.0001. So the closer the weight is to zero, the less the regularization “forcing function” works to further reduce the weight values in the L2 case.

rmwkwok · November 26, 2022, 1:32am

Just to supplement Paul’s answer, here is a graph for a L1 and L2 regularization term. At close to zero, the gradient (slope) in L2 can be much smaller than L1.

Source

Raymond

Doug_Cutrell · November 26, 2022, 3:04am

Thanks for the illuminating comments. I also found something in the recent 3rd edition of Hands-On Machine Learning, which has a discussion of exactly this as “Lasso Regression” (Least Absolute Shrinkage and Selection Operator Regression), on p 159 and 160, with some figures. The answer is fairly obvious in retrospect, and maybe just what you were telling me: the L1 regression term uniformly reduces each weight by a constant amount each step, so the smallest weights will hit zero early on, and then be kept there.

Topic		Replies	Views
Regularization - L1 and L2 Improving Deep Neural Networks: Hyperparameter tun week-module-1	4	33	August 1, 2025
L1 Regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	585	July 26, 2022
Regularization : Do larger weights imply complex model? Advanced Learning Algorithms week-module-3	5	615	September 7, 2022
Doubt on Regularization Supervised ML: Regression and Classification week-module-3	8	144	June 3, 2024
Week_3_lab_09_vectorized implementation of cost function and regularization term Supervised ML: Regression and Classification week-module-3	3	514	July 17, 2022

Course 2, week 1 L1 > L2 norm increasing sparseness, why?

Related topics