Questions on regularization


It would be great if I could get clarification on the following questions: -

  1. Cost is computed as follows: -
    J(w, b) = (1/m)Sum of losses across all training examples + (lambda/2m)(norm of W-squared).
    The question is - Why do we have ‘m’ in the denominator of the regularization term?
    Does this not mean that the value of the regularization term would decrease with increasing size of the training sample? I am not sure why the regularization term should be a function of the training sample size.

  2. If regularization is not done, it is possible that one of the parameters ends up with a very large value (say 1000) and another ends up with a small value (say 5). However, we apply the same learning rate (say 0.01) to all the parameters during gradient descent. Consequently, gradient descent would take a long time to converge. However, I understand that regularization will help address this as it reduces the magnitude of all parameters to a similar scale. Is this understanding correct?

  3. For regularization, lambda would have to be set to a large value. However, do we need to ensure that (learning_rate * lambda)/m is always a small positive value (close to 0)?
    My guess is that though theoretically lambda could be made so large that the above expression could be evaluated to a high positive value, practically, since the learning_rate would be very small and the number of training examples would be very high, it is unlikely that this would ever happen and the expression is more likely to be close to zero (unless we recklessly set the value of lambda to a very high number).

Thank you for your time!


Hello @Divyaman_Singh_Rawat,

  1. The m serves as an adjustment factor that the more samples there are, the less regularization is needed. This is generally true because “increasing sample size” is a method that counter overfitting. You might get more feeling by training models using different subsets of a dataset where the subsets have different sizes, and also use different values of lambda. You may also fool the adjustment factor m like this - increase the sample size by making many copies of one existing sample. If you can easily fool it, you know it is not robust, although it is not fair to ask for robustness from such a simple way of adjustment.
  1. The above statement is not necessarily true. Even though the learning rate is the same, the gradient \frac{\partial{J}}{\partial{w}} can be different. Regularization tends to shrink the values of the weights, but I have never seen any proof that it scales them to similar range.

  2. Learning rate and Lambda are always positive. You need to fine-tune the learning rate and lambda yourself. If you are looking for a rule or typical values of them, I would recommend you to tune it to your dataset.


1 Like

Thank you for your answers!
I will, perhaps, need to try some of the things you recommended to get a better of things.