If the hope is that the cost should be nearly equal to zero (or close to it anyway), why do we always add the regularization term to the calculated loss? Why don’t we check the negativity of the calculated loss first and then based on it, add or subtract the regularization to bring the cost close to zero?
Wouldn’t always adding the regulariation term causes the cost to increase if it is already greater than zero?
How is this exactly penalising the weights for a specific feature?
If our model overfits the data, what would be the loss value? Close to zero, right? That is why we add a regularization term to balance the learning. Subtracting it would reduce the loss further, leading to overfitting.
@themightywolfie I wish I had a better way to explain it (note: ‘better way’ only applies to what I was trying to answer below, not @saifkhanengr’s reply), but it is early here and I haven’t had my coffee yet.
Where you ask ‘how does this penalize a specific feature?’-- Well, I think the better way to think of it is all features get penalized, but those much larger and out of range (kind of like your outliers) get penalized much more, which is what produces the ‘regularization’ effect.
The cost/loss values are always positive. The goal of training is then to minimize that, which for a positive value means making it as close to zero as possible.
The point of the regularization term is that we add that to the cost (only during training). There are a number of different forms of regularization, but the one here is L2 regularization which adds the sum of the squares of all the weights (scaled by a constant) as the regularization term. If we minimize the new cost (the original or “real” cost plus the L2 sum), then we have to work with both terms. If we have an overfitting problem, one cause is that the model is putting too much emphasis on certain inputs, meaning their corresponding weight values are relatively large. So if we also need to minimize the L2 term, then that provides a “forcing function” that pushes all the individual weights to be small. We could easily get the L2 term to be zero by setting all the weights to zero, right? But then that’s going to make the real base cost term huge, because the model is just useless and makes the same prediction no matter what the input is.
So it’s all about the balancing act between the real cost function and the L2 term. Of course there is also the \lambda value that plays a key role in exactly how that balancing act plays out. If we make \lambda too large, then the L2 term dominates and drives everything pretty close to zero. So we need to run some experiments to find a “Goldilocks” \lambda value that gives us a good balance between variance and bias and damps down the overfitting.
The other high level point here is our goal is to affect the training, so that it gives us a model that has better overall results. Meaning that it is not too specific to the training data (which is what “overfitting” means) and generalizes well to new data.
Of course training is all driven by the gradients of the cost function: that is what controls how the parameters of the model (weight and bias values) get updated. In other words which direction to push them to get better results.
By modifying the cost function with the regularization term, we are modifying the way the training can achieve minimal cost. We are saying not only do we want you to make accurate predictions, but we want you to do it in a way that keeps the individual weight values relatively smaller in magnitude.