It was mentioned in the course as well as demonstrated in the practice lab that by adjusting the value of the regularization parameter, we can control the amount of overfitting. How can we understand how does a large regularization parameter minimize the coefficients that corresponds to higher order polynomials?
Specifically how can we see this from the update step? Shouldn’t a large regularization parameter have the same effect as a large learning rate alpha? i.e. the coefficients bouncing between values instead of converging (imagine lambda is much bigger than the derivative term)?
Regularization helps suppress contribution of features (weights) that make the function fluctuate too much, adding too much noise. Learning rate alpha has to do with the amount of leap you take towards an optima of the model.
This video here I think does a pretty good job explaining regularization:
This video describes the idea of using an additional term to minimize coefficients of high order polynomial. I understand the idea but don’t see how you can see it from the equation. I am asking how this is actually achieved at the iteration step.
I understand the the learning rate. Just take a look at the equation of the iteration. If the first term is dominating, what will you have? w=-w*(very large number). What happens at the next iteration? It will oscillate between very large and very positive numbers very much like what happens if the learning rate is large