Why is the regularization term in the cost function w_j^2 instead of just w_j?

Link to video HERE at about 4:00.

There was no clear explanation of where the w_j^2 regularization term came from in the cost function. Why is it w_j^2 instead of just w_j?

Just like how we compute the cost from the squares of the errors (so that all errors become a positive value, and larger errors have a bigger impact), squaring the weights has the same reasoning.

The goal is to have an additional penalty for having large weight values. This helps the algorithm learn smaller weights, thereby reducing overfitting.

The “sum of the squares” method has some nice computational properties when we compute the gradients (i.e. partial derivatives).

1 Like

Thanks for the explanation! As I understand it, using a squared weight is more of a choice for its positive characteristics than a term derived mathematically? I’m still wrapping my head around how much of the cost function is mathematically derived vs. found to produce positive results


Cost functions are designed to suit a specific goal. The linear regression cost is based on the method of least squares. The logistic regression cost is based on maximizing the penalty for making wrong binary predictions (classifications).

1 Like