How come penalising all wj terms help in reducing only non important wj parameters in regularization/gradient descent?

How come penalising all wj terms help in reducing only non important wj parameters in regularization ?

All features are eligible for reduction via regularization. The algorithm has no knowledge of a feature’s importance. It’s just a mathematical process.

But features that aren’t important are going to have very small weight values, so regularization isn’t going to make them very much smaller.

1 Like

@TMosh thank you for your reply!!

But I have follow up questions

  1. If non important features going to have less weights, then how regularisation helps in anyway ?

  2. If non important features going to have less weights, then model should not overfit in any case right ?

You don’t know in advance which features are more or less important. So you have to apply regularization to all of them, and let the optimizer figure out the details.

The machine is doing the learning, so you don’t have to.