Really can’t understand, how from this formula of regularization can we say, that increasing of lambda will reduce W?
Hi, @someone555777 !
When increasing lambda, we are increasing the overall value of the cost function. That means that when the training tries to minimize the cost function, it will try the hardest to minimize the values of w because it has the biggest impact
First, understand why we need the regularization term. When our model overfits, a good job at training data but a poor job at test, it means our model does not generalize well. So, we need to penalize it. Penalization means we need to increase the cost (J). So, when we add that extra regularization factor, it means we are trying to increase the cost. However, at the same time, we also try to decrease the cost by optimization (gradient descent).
So, gradient descent will try to reduce the cost while regularization term will try to increase the cost. That is a clash, right? The more the lambda value, the more the cost will be. But gradient descent is trying hard to minimize the cost, so it will reduce the value of parameters (W).
Best,
Saif.
so, I still can’t understand why there is only reducing of paremeter W and not can be increasing too. As I understand cost function can be increased on both sides as when W is incrasing as when it is reducing
Hi @someone555777,
The gradient descent update formula is w := w - \alpha\times\frac{\lambda}{2m}\times(2w) - \cdots. Note that we are focusing on the regularization term, so I didn’t show the error term and replaced it with \cdots.
Ask yourself these questions, given that \alpha \lambda m are always positive:
-
if w is positive, will the regularization term make the updated w be larger in magnitude (sign not considered), or smaller?
-
if w is negative, will the regularization term make the updated w be larger in magnitude (sign not considered), or smaller?
If you find both to be smaller in terms of magnitude (sign not considered), then that explains the titled question. If you find any of it larger in terms of magnitude (sign not considered), tell me which one and explain it, and we can discuss your explanation.
Raymond
so, do I understand correct, that it can reduce influence — abs of number, but sign — never?
I think that’s exactly backwards: the regularization term reduces the magnitude of the elements of w always, not never. That was Raymond’s point. Work out the examples that he suggests.
Well, if you get fully technical, you can probably construct an example in which the subtraction of the “update” term actually flips the sign of the corresponding element of w, but the point is that is not what happens with typical values of \alpha and \lambda. If you use large enough values of \alpha and \lambda to flip the sign, then you’ll likely get divergence in any case, so that’s not really a relevant example for the point that Raymond is making here.
As Paul explained, and everything which is not forbidden is allowed.
If you have an interesting special case in your mind, feel free to share it.