Struggling to get intuition of the regularization term. Like how adding the L2 regularization term decreases the weight of useless features and why it doesn’t make significant impact (or doesn’t decreases) other features’ weight?

Thanks!

Struggling to get intuition of the regularization term. Like how adding the L2 regularization term decreases the weight of useless features and why it doesn’t make significant impact (or doesn’t decreases) other features’ weight?

Thanks!

@om_pandey07 I wouldn’t say so much that the features regularization mostly effects (applies the highest penalty to) are ‘useless’.

Rather the goal is to avoid overfitting the model and increase generalization. You are trying to reduce noise and the effect of extreme outliers in the model.

I mean in the simplest case, imagine you have a data set of 10 points. Eight of these points are highly linearly correlated (so imagine you can drawing a line through it)-- but two of these points are way off.

Obviously you don’t want them to have *too* much influence on your model, so those weights are decreased.

However, this happens not to just those features-- but *all* the features in your model. If you did not do so you would be essentially changing your dataset (and now answering a different problem).

The key thing is as these other points are more inline with the rest of the data, their weight decrease is much less.

I see…so how does adding that regularization term helps us decreasing the weight and how does it “knows” what features weight it has to reduce more.

I mean, like , if a feature is relevant, it’s weight would be generally be high, thus adding that term will increase the loss, so won’t the model will decrease the weight of it also?

Well the ultimate effect can also be a bit different whether we are talking about L1 or L2 regularization.

Are you asking about in the case of L2 specifically?

Yes for L2

Well a big part of this is you have to recall the regularization is tagged on to our cost function, and we are running through this with forward and back prop many times.

It ‘knows’ which terms are relevant because increasing their weight will decrease our loss.

However, since the regularization is a (literally) ‘+’ added on parameter, we end up with a trade off between whether the increase on a given weight term provides provides a net gain-- or net loss in cost versus the size of that particular terms addition to the regularization parameter (since L2 is a squared sum).

Regularization is going to *increase* cost, loss, one way or another, but it also forces the model to adjust the weights in such away that the overall benefit of increasing the weight on some nodes, while decreasing others, provides an end net gain.

If cost *increases* for that node, the weights are forced down. The opposite occurs of vice versa for *decreases*.

That clearly helps! Thank you so much

Yes, the effect of the L2 regularization term will be more pronounced on weights with larger magnitudes. If a given weight is already lower in absolute value, then it will not be affected as much. As Anthony mentioned, the intent is to prevent particular individual features from having too much influence on the results on the theory that this situation represents overfitting. Of course there is no guarantee that this (meaning L2 regularization) will work in every case and it’s also very much a matter of correctly tuning the \lambda value. It’s a battle between the base cost function and the regularization term. If you make \lambda a relatively large value, then the L2 term dominates and in the limit you drive all the weights to zero and the model becomes useless. It’s been a while since I watched these lectures, but I’m pretty sure I remember Prof Ng making that point in the lectures. To succeed here you need to experiment to determine just the “Goldilocks” amount of the L2 suppression of the weights to give a general solution.

1 Like

Thanks for the explaination :))