Normalizing the regularizer

ajallooe · April 27, 2023, 9:13pm

Hi!

When we are calculating cost, we take the average by dividing the sum of losses on examples by \frac{1}{m}. This makes sense as it makes the magnitude of the cost comparable to that of loss on a single point. What I saw Andrew do here is he is also normalizing the regularizer by dividing the regularization coefficient \lambda by m (or 2 m to be precise). However, in the case of regularizer this does not make sense to me as the magnitude of parameters does not depend on the number of examples. Can somebody enlighten me on why this practice is done? This actually brings the effect of regularization out of comparability with that of cost and would take away any “absolute” meaning to the value of \lambda. I know the effect can be compensated by increasing the value of \lambda by m-fold, but then again, I don’t see why would we want to do that and deprive the interpretability of \lambda (I also know that as you get more points you may want to use less regularization, so it can make kind of sense that way, but why not leave that to hyperparameter tuning and retain the meaningfulness of the “absolute” value of \lambda).

Thanks,
Mohammad

TMosh · April 27, 2023, 9:45pm

The purpose of dividing by ‘m’ is that it reduces the effect of regularization for large data sets (where ‘m’ is large).

This is helpful because empirically it’s been found that large data sets (where m >> n) do not require as much regularization to avoid overfitting.

ajallooe · April 27, 2023, 10:35pm

Thank you Tom. I completely sympathize with that reason. But, then my question becomes why not be conscious about that and intently set lower values for \lambda when you are doing hyperparameter tuning? We kind of have this intuition from statistics, from normally distributed random variables, that the variance goes down relative to \sqrt{m}. I’m not saying that this translates directly mathematically, but for example, how about not dividing \lambda by m but rather start the ranges of hyperparameters you experiment on centered around some value \frac{c}{\sqrt{m}}?

TMosh · April 28, 2023, 12:26am

There are a lot of ways it could work. Many would work equally well.

The machine learning industry evolved over a relatively short period of time, with a lot of people working independently. There really isn’t a lot of consensus on specific practices. Statistics has been around for centuries and it has had time to mature.

If you have a strong statistics background, it’s best to set that aside for a while, and you’ll see that ML accomplishes many of the same things but using methods that evolved separately.

ajallooe · April 28, 2023, 1:02am

Thank you again for your answer Tom. I actually come from an non-DL ML (statistical ML) background (I was living under a rock for a very long time and while I knew the ideas of DL, I never bothered to learn more about it and that’s why I take this course) and I usually saw the regularization coefficient not being “normalized”. With regards to setting the stats view aside, I am aligned with you. I actually teach ML and some of my biggest problems are with people who do have a stats background as they want to hit things with their stats hammer and see everything as a nail for that hammer and fail to appreciate the differences in perspective we have with them. I would say, we use the same methods as with them but viewing the problem in a completely different light. We are concerned with performance, they are mainly concerned with explanation. Though with DL and especially very big models, you can say on paper this is a statistical methods that kind of averages the data, but really the model is so huge, it can fit everything and the averaging part of what is done in the network is not that prominent, so DL is farther away from a statisticians background compared to non-DL ML.

Thanks again for the answer and discussion,
Mohammad

Topic		Replies	Views
Questions on regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	471	July 17, 2023
C2_W1_regularization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	515	August 30, 2022
Regularization, lambda/m Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	563	December 21, 2021
Regularization question Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	357	September 12, 2023
Question About L2 Regularization Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	3	144	April 29, 2024

Normalizing the regularizer

Related topics