Hello,
I’m having a bit of trouble understanding how regularization works intuitively. The concept of overfitting makes lots of sense - from my understanding your model conforms to the training data too well at the cost of predicting new data. One of the problems that leads to this is too many degrees of freedom if we have lots of features.

To take care of this we try and minimize the parameters - but why are large parameters a sign of overfitting? Also how do we define large parameters? If the units are in millimeters the parameters will obviously be larger numbers than if they were in kilometers. How do we account for this?

Lastly, how does the minimization of the cost function know which parameters to greatly attenuate and which ones to not attenuate as much? Intuitively, how does the math behind this work out? I understand the regularization term in the cost function penalizes bigger parameters, but how does the minimization process know to attenuate the w_j in front of an x^6 term much more than the w_j in front of an x^2 term, for example? I get that the balancing act with the MSE term in the cost function plays into it, but I’m just having a hard time understanding how it all balances out.

The idea is that by reducing the magnitude of the weights, we will have a slightly worse fit to the training data. It’s simple to implement and has a very easily computed partial derivative - that makes the gradients easy to compute.

Minimization has no idea which features to attenuate. it isn’t that smart. It just reduces all of them by the same ratio (the lambda parameter).

I will break down your question and I will try to answer each of them

Why are large parameters a sign of overfitting?

Large parameters aren’t necessarily a sign of overfitting, but they can be. Overfitting usually occurs when the model is too complex, capturing noise in the training data instead of the underlying pattern. Large parameters can be an indication of this, as they may cause the model to fit very closely to the training data, even at the expense of generalizing well to new data. Regularization helps to counteract this by discouraging the model from using overly large parameters.

How do we define large parameters, and how do we account for different units of measurement?

You’re right that the scale of the features matters when defining what is considered a large parameter. One way to account for this is to normalize or standardize your features before training your model. This ensures that all features have the same scale, making it easier to compare the magnitudes of the parameters.

How does the minimization of the cost function know which parameters to attenuate?

The addition of the regularization term to the cost function influences the minimization process. Let’s take a look at L1 and L2 regularization, two common types:

L1 (Lasso) regularization: This adds the sum of the absolute values of the parameters (multiplied by a regularization constant) to the cost function. During optimization, this encourages some parameters to shrink all the way to zero, effectively removing them from the model. This can help select the most important features.

L2 (Ridge) regularization: This adds the sum of the squared values of the parameters (multiplied by a regularization constant) to the cost function. During optimization, this encourages the parameters to be smaller in magnitude. This is less likely to completely remove features but rather balances their contributions.

In both cases, the regularization term works alongside the mean squared error (MSE) or other error terms in the cost function. During optimization, the model seeks to minimize the cost function, balancing the need to fit the data well (minimizing the MSE) and keeping the parameters small (minimizing the regularization term). This balance is influenced by the regularization constant, which determines the trade-off between fitting the data and regularizing the parameters.

In summary, regularization discourages the model from using overly large parameters by adding a penalty term to the cost function. The optimization process then tries to minimize this cost function, which results in a balance between fitting the data well and keeping the parameters small. Normalizing or standardizing your features helps ensure that the regularization term operates on a consistent scale, and the choice of regularization type (L1 or L2) can influence which parameters are more greatly attenuated.