Neural network discussion on hyper parameters

tridib87 · December 1, 2024, 10:49am

Please explain how lambda in l2 can affect weights in both forward prop and backward prop?

balaji.ambresh · December 1, 2024, 2:04pm

There are the steps for training:

In forward pass, invoke model.predict on a batch of data.
Calculate loss.
Since we’re using L2 regularization, add an additional regularization term i.e. \frac{\lambda * sum(weights ** 2)}{batch\_size} to compute the overall loss
You don’t have to worry about the backward pass since almost all modern frameworks like tensorflow and pytorch track details for the backward pass. That said, to do this manually, for each weight, you’ll have to add \frac{2 * \lambda * weight_i}{batch\_size} term as part of calculating gradient of loss with respect to the weight.

See this for penalty calculation of L2 regularization.

tridib87 · December 1, 2024, 3:05pm

How increasing the value of lambda decrease the value of weights?

In forward prop calculation of weights not required, but in backward prop increasing lambda in l2 leads to reduction in the value of weights? How

balaji.ambresh · December 1, 2024, 5:08pm

When you increase the weight of lambda, the additional term in backward pass (see point 4 from previous reply) reduces the weight by a larger value.
Does this help?

balaji.ambresh · December 1, 2024, 5:16pm

@TMosh @rmwkwok
Does MLS cover regularization?

TMosh · December 1, 2024, 5:56pm

Yes, in MLS Course 2 Week 3.

tridib87 · December 1, 2024, 6:00pm

I need to go through the concepts again. I will get back if I have any doubt

paulinpaloalto · December 1, 2024, 9:09pm

The point is when you use L2 regularization, you are adding a new term to your loss function. It is now the original loss plus the L2 term and your goal is to minimize the sum of those two terms. Well, there’s an obvious way to minimize the L2 term, right? Just set all the W values to zero and that’ll do the trick, regardless of the value of \lambda. But that will give you a big loss in the first term (the pre-existing loss function).

So what happens in back prop is a balancing between the two loss terms. How dominant the L2 term is depends on how large the value of \lambda is, right? The larger you make \lambda, the more that biases the loss in favor of small absolute values for the weights. If you set \lambda = 0 or a very small value, then the L2 term has almost no effect.

The goal is to find a good value for \lambda that reduces the overfitting you were originally having while still giving you high accuracy on the validation and test data. That requires some tuning of course to find a good value.

Topic		Replies	Views
Week 1 - Doubt in the Math Improving Deep Neural Networks: Hyperparameter tun	3	551	May 21, 2022
Hyperparameter lambda: how is this wrong? Neural Networks and Deep Learning	3	549	January 2, 2023
Question on how Lambda works Supervised ML: Regression and Classification week-3	9	505	February 22, 2023
Relation between Lambda(Regularization Parameter) and Weight? Improving Deep Neural Networks: Hyperparameter tun	3	586	July 14, 2021
Regularization Intuition In Programming Assignment Improving Deep Neural Networks: Hyperparameter tun	2	518	July 13, 2021

Neural network discussion on hyper parameters

Related topics