This is the equation for regularised regression ,

I don’t understand why we reduce all the wj instead of the large ones?

{When we add lambda/2m times summation of wj^2 , and then try to minimise the cost function , It means that we reduce all the w’s by a greater amount as if they were when the data was overfit(before adding the lambda term) - because earlier w was involved in the model function only but now it is outside also (Outside the loss function in this case). }

~ Kay

@Kavalanche in the end you do end up penalizing the terms with a higher deviation more-- However, as to your question as to why not just apply it to the larger ones, rather than the entire set ?

Well, think of it this way: If you applied your transform in an uneven manner, while you’d be pulling your overall set of weights closer together, you’d also be *changing* your data set. I.e. even though the deviation on larger weights is greater they still contain vital information about your data set. If you *only* compressed those they would no longer be in line with the data represented by the smaller weights. Thus you must apply the transform equally across the full set of weights so everything moves together.

1 Like