When we are calculating cost, we take the average by dividing the sum of losses on examples by \frac{1}{m}. This makes sense as it makes the magnitude of the cost comparable to that of loss on a single point. What I saw Andrew do here is he is also normalizing the regularizer by dividing the regularization coefficient \lambda by m (or 2 m to be precise). However, in the case of regularizer this does not make sense to me as the magnitude of parameters does not depend on the number of examples. Can somebody enlighten me on why this practice is done? This actually brings the effect of regularization out of comparability with that of cost and would take away any āabsoluteā meaning to the value of \lambda. I know the effect can be compensated by increasing the value of \lambda by m-fold, but then again, I donāt see why would we want to do that and deprive the interpretability of \lambda (I also know that as you get more points you may want to use less regularization, so it can make kind of sense that way, but why not leave that to hyperparameter tuning and retain the meaningfulness of the āabsoluteā value of \lambda).
Thank you Tom. I completely sympathize with that reason. But, then my question becomes why not be conscious about that and intently set lower values for \lambda when you are doing hyperparameter tuning? We kind of have this intuition from statistics, from normally distributed random variables, that the variance goes down relative to \sqrt{m}. Iām not saying that this translates directly mathematically, but for example, how about not dividing \lambda by m but rather start the ranges of hyperparameters you experiment on centered around some value \frac{c}{\sqrt{m}}?
There are a lot of ways it could work. Many would work equally well.
The machine learning industry evolved over a relatively short period of time, with a lot of people working independently. There really isnāt a lot of consensus on specific practices. Statistics has been around for centuries and it has had time to mature.
If you have a strong statistics background, itās best to set that aside for a while, and youāll see that ML accomplishes many of the same things but using methods that evolved separately.
Thank you again for your answer Tom. I actually come from an non-DL ML (statistical ML) background (I was living under a rock for a very long time and while I knew the ideas of DL, I never bothered to learn more about it and thatās why I take this course) and I usually saw the regularization coefficient not being ānormalizedā. With regards to setting the stats view aside, I am aligned with you. I actually teach ML and some of my biggest problems are with people who do have a stats background as they want to hit things with their stats hammer and see everything as a nail for that hammer and fail to appreciate the differences in perspective we have with them. I would say, we use the same methods as with them but viewing the problem in a completely different light. We are concerned with performance, they are mainly concerned with explanation. Though with DL and especially very big models, you can say on paper this is a statistical methods that kind of averages the data, but really the model is so huge, it can fit everything and the averaging part of what is done in the network is not that prominent, so DL is farther away from a statisticians background compared to non-DL ML.
Thanks again for the answer and discussion,
Mohammad