How does regularization work for logistic regression

It was mentioned in the course as well as demonstrated in the practice lab that by adjusting the value of the regularization parameter, we can control the amount of overfitting. How can we understand how does a large regularization parameter minimize the coefficients that corresponds to higher order polynomials?


Specifically how can we see this from the update step? Shouldn’t a large regularization parameter have the same effect as a large learning rate alpha? i.e. the coefficients bouncing between values instead of converging (imagine lambda is much bigger than the derivative term)?

1 Like

Regularization helps suppress contribution of features (weights) that make the function fluctuate too much, adding too much noise. Learning rate alpha has to do with the amount of leap you take towards an optima of the model.

This video here I think does a pretty good job explaining regularization:

1 Like

This video describes the idea of using an additional term to minimize coefficients of high order polynomial. I understand the idea but don’t see how you can see it from the equation. I am asking how this is actually achieved at the iteration step.

I understand the the learning rate. Just take a look at the equation of the iteration. If the first term is dominating, what will you have? w=-w*(very large number). What happens at the next iteration? It will oscillate between very large and very positive numbers very much like what happens if the learning rate is large

1 Like

That’s just an easy example, which Andrew tends to use when explaining regularization.

In practice, all of the features have regularization applied with equal emphasis.

The weights will only oscillate if the learning rate is too large.