hello everyone
there is a point that i don’t understand . why in the regularization formula we divide the term by a factor of m which is the #of training examples since we sum the weights associated to our features so our sum term is from 1 to nx as shown on the image below so i think it’s more logical to divide by nx instead.
It is an interesting question. I do not know the answer, but perhaps we will get lucky and someone who knows more will chime in. This has come up a number of times in the past. One high level point is that not everyone formulates L2 Regularization in that way. E.g. here’s a lecture from Prof Geoff Hinton which covers L2 and you’ll see that he uses the factor \frac {\lambda}{2} times the sum of the squares of all the weights.
So apparently this is a choice that Prof Ng has made and there are other ways to make that choice. One idea that I can think of that might motivate scaling the factor by \frac {1}{m} is that perhaps it makes the choice of \lambda a bit easier in that you can pick one value and it will still work if you change the size of your dataset. With Prof Ng’s formulation the effect of L2 regularization will be decreased the larger your training dataset is. And of course we know that one of the primary ways to address the problem of overfitting is to increase the size of your training set. In the limit as m \rightarrow \infty then the need for regularization goes to zero. Just a thought, which maybe gives some intuition. As I mentioned above, I say this with the disclaimer that I don’t really know the definitive answer.
I like the intuition you made about this , i think that makes sense to me now