Questions on regularization

Divyaman_Singh_Rawat · July 13, 2023, 10:51am

Hello,

It would be great if I could get clarification on the following questions: -

Cost is computed as follows: -
J(w, b) = (1/m)Sum of losses across all training examples + (lambda/2m)(norm of W-squared).
The question is - Why do we have ‘m’ in the denominator of the regularization term?
Does this not mean that the value of the regularization term would decrease with increasing size of the training sample? I am not sure why the regularization term should be a function of the training sample size.
If regularization is not done, it is possible that one of the parameters ends up with a very large value (say 1000) and another ends up with a small value (say 5). However, we apply the same learning rate (say 0.01) to all the parameters during gradient descent. Consequently, gradient descent would take a long time to converge. However, I understand that regularization will help address this as it reduces the magnitude of all parameters to a similar scale. Is this understanding correct?
For regularization, lambda would have to be set to a large value. However, do we need to ensure that (learning_rate * lambda)/m is always a small positive value (close to 0)?
My guess is that though theoretically lambda could be made so large that the above expression could be evaluated to a high positive value, practically, since the learning_rate would be very small and the number of training examples would be very high, it is unlikely that this would ever happen and the expression is more likely to be close to zero (unless we recklessly set the value of lambda to a very high number).

Thank you for your time!

Regards,
Divyaman

rmwkwok · July 13, 2023, 12:06pm

Hello @Divyaman_Singh_Rawat,

The m serves as an adjustment factor that the more samples there are, the less regularization is needed. This is generally true because “increasing sample size” is a method that counter overfitting. You might get more feeling by training models using different subsets of a dataset where the subsets have different sizes, and also use different values of lambda. You may also fool the adjustment factor m like this - increase the sample size by making many copies of one existing sample. If you can easily fool it, you know it is not robust, although it is not fair to ask for robustness from such a simple way of adjustment.

The above statement is not necessarily true. Even though the learning rate is the same, the gradient \frac{\partial{J}}{\partial{w}} can be different. Regularization tends to shrink the values of the weights, but I have never seen any proof that it scales them to similar range.
Learning rate and Lambda are always positive. You need to fine-tune the learning rate and lambda yourself. If you are looking for a rule or typical values of them, I would recommend you to tune it to your dataset.

Raymond

Divyaman_Singh_Rawat · July 17, 2023, 9:53am

Thank you for your answers!
I will, perhaps, need to try some of the things you recommended to get a better of things.

Topic		Replies	Views
Normalizing the regularizer Improving Deep Neural Networks: Hyperparameter tun	4	481	April 28, 2023
Regularization, lambda/m Improving Deep Neural Networks: Hyperparameter tun	4	561	December 21, 2021
C2_W1_regularization Improving Deep Neural Networks: Hyperparameter tun	2	515	August 30, 2022
Question About L2 Regularization Improving Deep Neural Networks: Hyperparameter tun week-1	3	144	April 29, 2024
Why does regularization reduce w? Improving Deep Neural Networks: Hyperparameter tun	7	585	August 18, 2023

Questions on regularization

Related topics