Why do we square the paramters in regularization?

Apart from calculation simplification is there any other reason for squaring.
Why not raise it to the power 2,3,… etc.

Squaring has several nice properties:

  • The partial derivative is easily computed. This is important for efficient gradient decent code.
  • The partial derivative is defined for all values.
  • Both positive and negative errors are handled automatically.
  • Large errors have greater impact on the cost than small errors.
1 Like

I agree to all the points but will this not be same for x^4

1 Like

Hello @vaibhavoutat,

We can further ask about w^8, w^{32}, and w^{100}, but I would choose the smallest order w^2, because (1) it is sufficient and (2) I don’t want to run into numbers close to infinity or close to zero. For example, w^{100} will make w=3 too large and w = 0.003 too close to zero, and both cases are bad for computers to handle.

Having said that, these are my reasons for w^2, but you can have your reasons for w^4.


1 Like

Are you discussing creation of additional features, or computing the cost?
They are rather different topics.

I am talking about Computing cost only

Once you compute the feature values, they’re simple floats.

So there won’t be any “fourth power” that you need to use in computing the cost. It’s still the sum of the squares of the errors.