Please explain how lambda in l2 can affect weights in both forward prop and backward prop?
There are the steps for training:
- In forward pass, invoke
model.predict
on a batch of data. - Calculate loss.
- Since we’re using L2 regularization, add an additional regularization term i.e. \frac{\lambda * sum(weights ** 2)}{batch\_size} to compute the overall loss
- You don’t have to worry about the backward pass since almost all modern frameworks like tensorflow and pytorch track details for the backward pass. That said, to do this manually, for each weight, you’ll have to add \frac{2 * \lambda * weight_i}{batch\_size} term as part of calculating gradient of loss with respect to the weight.
See this for penalty calculation of L2 regularization.
How increasing the value of lambda decrease the value of weights?
In forward prop calculation of weights not required, but in backward prop increasing lambda in l2 leads to reduction in the value of weights? How
When you increase the weight of lambda, the additional term in backward pass (see point 4 from previous reply) reduces the weight by a larger value.
Does this help?
Yes, in MLS Course 2 Week 3.
I need to go through the concepts again. I will get back if I have any doubt
The point is when you use L2 regularization, you are adding a new term to your loss function. It is now the original loss plus the L2 term and your goal is to minimize the sum of those two terms. Well, there’s an obvious way to minimize the L2 term, right? Just set all the W values to zero and that’ll do the trick, regardless of the value of \lambda. But that will give you a big loss in the first term (the pre-existing loss function).
So what happens in back prop is a balancing between the two loss terms. How dominant the L2 term is depends on how large the value of \lambda is, right? The larger you make \lambda, the more that biases the loss in favor of small absolute values for the weights. If you set \lambda = 0 or a very small value, then the L2 term has almost no effect.
The goal is to find a good value for \lambda that reduces the overfitting you were originally having while still giving you high accuracy on the validation and test data. That requires some tuning of course to find a good value.