Hello,
It would be great if I could get clarification on the following questions: -
-
Cost is computed as follows: -
J(w, b) = (1/m)Sum of losses across all training examples + (lambda/2m)(norm of W-squared).
The question is - Why do we have ‘m’ in the denominator of the regularization term?
Does this not mean that the value of the regularization term would decrease with increasing size of the training sample? I am not sure why the regularization term should be a function of the training sample size. -
If regularization is not done, it is possible that one of the parameters ends up with a very large value (say 1000) and another ends up with a small value (say 5). However, we apply the same learning rate (say 0.01) to all the parameters during gradient descent. Consequently, gradient descent would take a long time to converge. However, I understand that regularization will help address this as it reduces the magnitude of all parameters to a similar scale. Is this understanding correct?
-
For regularization, lambda would have to be set to a large value. However, do we need to ensure that (learning_rate * lambda)/m is always a small positive value (close to 0)?
My guess is that though theoretically lambda could be made so large that the above expression could be evaluated to a high positive value, practically, since the learning_rate would be very small and the number of training examples would be very high, it is unlikely that this would ever happen and the expression is more likely to be close to zero (unless we recklessly set the value of lambda to a very high number).
Thank you for your time!
Regards,
Divyaman