Lecture “why regularization reduces overfitting?” time 1.23

Why cranking lambda to be big, sets w=0?

Thank you.

It doesn’t set any of the W_{ij} values to zero. It just causes the optimization to push the values to be smaller. The regularization term is \lambda times the L2 norm of w, so if you want that product to be small and \lambda is large, that forces the norm of w to be small, right? This is not some deep or subtle point. What you are minimizing is the sum of the usual “log loss” cost plus the L2 regularization term.

Of course if you were only seeking the minimum value of \displaystyle \frac {\lambda}{2m}||w||^2, then there is an obvious solution: w = 0. But the point is that would (one hopes) not give a very good solution for your actual model, meaning that the “log loss” term would be large. So what you are seeking is a value of \lambda that gives a good “balance” between the log loss term and the regularization term. If you use some huge value of \lambda, you might very well end up with most elements of w = 0 and thus a bad solution. It wouldn’t overfit, but it wouldn’t be useful for much either.