About lamda divide M multiply WJ

yo_yo_ben_zhou · February 2, 2024, 8:00am

Actually I still not quite get why lamda divide M multiply Wj would help to regularlized the wj, can anyone using more straight forward proving to show how it impact the polynormials?
lamda

TMosh · February 2, 2024, 8:21am

The key is how lambda is used in the cost equation. It is a multiplier for the sum of the squares of the weights.

This multiplier increases the cost, so it creates an incentive for the system to learn smaller weight values.

The equation you quoted is the gradients (the partial derivative of the cost). The gradients are how the weights are found that minimize the cost.

paulinpaloalto · February 2, 2024, 5:45pm

If the question is why they bother scaling the L2 regularization term by \frac {\lambda}{m} versus just using plain \lambda, I think you could have done it either way. It’s just a constant after all. I have never actually seen an “official” explanation of this, but my theory is that the reason is that you want to make the selection of the \lambda hyperparameter orthogonal to the size of the training set. Note that when you do the training, you may have several different training sets, e.g. a smaller subset that you use early on to speed up the experimentation process when you’re playing with hyperparameters and then the full training set once you feel like you’re getting pretty close with the hyperparameter choices. It would be awkward if you had to separately tune the \lambda differently with the two differently sized datasets. The other intuition here is that the purpose of L2 regularization is to eliminate or mitigate overfitting. The other strategy for minimizing overfitting is to get more training data. So with the factor of \frac{1}{m}, in the limit as m \rightarrow \infty, the L2 term goes to zero. If you have the ability to add more data (which is not always practical), then you wouldn’t also have to do further fiddling with the \lambda value, in theory at least.

But don’t forget my disclaimer from earlier: this is just my theory and I don’t have any external evidence to support it.

Topic		Replies	Views
Why does the regularization term in L2 Regularization include division by the number of examples (m)? Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	2	70	April 10, 2025
Explanation of Lambda in Regularization of Linear Regression Cost Function Supervised ML: Regression and Classification week-module-3	2	238	July 21, 2024
Regularization, lambda/m Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	588	December 21, 2021
C2_W1_regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	521	August 30, 2022
Questions on regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	497	July 17, 2023

About lamda divide M multiply WJ

Related topics