Hi can i just check how does raising lambda penalise the W parameter mathematically. I have been going through the videos and i get that raising lambda to an extremely hard value basically makes it such that only the b paramater will have an effect essential making 1 straight line horizontal to the x axis depending on its value. But i just cant understand how a large increase in lambda will reduce the W drastically.
Hello @zheng_xiang1,
If we focus on just the regularization term, if w decreases, would it make the cost smaller or larger?
Raymond
Smaller if w was to be decreased
Exactly! To optimize the cost, we would want the weights to shrink. Now, what is the gradient of the regularization term?
Raymond
the gradient should be lambda/m x w^2 i think.
Let’s look at this slide which includes the gradient of the regularization term:
if we look at the bottom left, we have the weight’s update forumla, and on the bottom right, we have the gradient terms.
Previously, we already know that shrinking weights can reduce cost, and you also said in your first post that you want to see this mathematically, and here you go.
Again let’s just look at the regularization term, when w_j is positive,
- from the bottom right, is the regularization’s gradient positive or negative?
- in the bottom left, would it drive to increase w_j or decrease? You need the answer from Q1.
Raymond
1)-ve, i think, not too sure why
2)decrease if 1 is -ve
oh so is it becuz, the derivative of reg term is -ve and if it gets larger it reduces the derivative of the cost function?
Let me re-ask my question.
- What is the gradient of the regularization term?
- If w_j is positive, is that term positive or negative? We need to be very careful about the signs.
- in the bottom left, would it drive to increase w_j or decrease? We need to be very careful about the signs.
The answer to 1 is \frac{\lambda}{m}w_j, can you read this from the slide? We need to be careful when reading the slide. If you are not familiar with differentiation, I suggest you to refer to other course materials about gradient descent without regularization, and compare out the additional term due to regularization. It will take some time, but going through that exercise might be helpful.
Take your time.
Here are my answers:
- \frac{\lambda}{m}{w_j}
- positive, because each symbol in the term is positive.
- decrease, because each symbol in -\alpha\frac{\partial{J}}{\partial{w_j}} is positive, except for the minus sign.
What would be the answers if, instead, w_j is negative?
- \frac{\lambda}{m}{w_j}
- negative. Try to verify it.
- increase. Try to verify it.
The conclusion is:
if w_j is positive, the regularization term tends to decrease w_j;
if w_j is negative, the regularization term tends to increase w_j.
Both tend to push w_j towards zero. Therefore, it shrinks the weights.
You asked how “Lambda” works in the title of this thread. The answer is, the larger the \lambda, the stronger the “pushing” force.
Cheers,
Raymond
THANKS A BUNCH, will revisit it soon !!