If regularization is used to reduce the effect of selected parameters, why do we add instead of subtract lamda / m times sum of those parameters in the Cost function??

Remember that our goal is to **minimize the cost function**, adding lambda*coefficients is in the sense of adding a penalty term for w, whenever w_i becomes larger it will increase the cost, pushing us to avoid the case where some coefficients are extremely large thus less prone to overfitting.

1 Like

The source is from computing the partial derivative of the cost equation. That’s where the gradients come from.

Since the cost equation adds the training penalty, its partial derivative is also included in the gradients.

1 Like