About lamda divide M multiply WJ

If the question is why they bother scaling the L2 regularization term by \frac {\lambda}{m} versus just using plain \lambda, I think you could have done it either way. It’s just a constant after all. I have never actually seen an “official” explanation of this, but my theory is that the reason is that you want to make the selection of the \lambda hyperparameter orthogonal to the size of the training set. Note that when you do the training, you may have several different training sets, e.g. a smaller subset that you use early on to speed up the experimentation process when you’re playing with hyperparameters and then the full training set once you feel like you’re getting pretty close with the hyperparameter choices. It would be awkward if you had to separately tune the \lambda differently with the two differently sized datasets. The other intuition here is that the purpose of L2 regularization is to eliminate or mitigate overfitting. The other strategy for minimizing overfitting is to get more training data. So with the factor of \frac{1}{m}, in the limit as m \rightarrow \infty, the L2 term goes to zero. If you have the ability to add more data (which is not always practical), then you wouldn’t also have to do further fiddling with the \lambda value, in theory at least.

But don’t forget my disclaimer from earlier: this is just my theory and I don’t have any external evidence to support it. :nerd_face: