Actually I still not quite get why lamda divide M multiply Wj would help to regularlized the wj, can anyone using more straight forward proving to show how it impact the polynormials?
The key is how lambda is used in the cost equation. It is a multiplier for the sum of the squares of the weights.
This multiplier increases the cost, so it creates an incentive for the system to learn smaller weight values.
The equation you quoted is the gradients (the partial derivative of the cost). The gradients are how the weights are found that minimize the cost.
If the question is why they bother scaling the L2 regularization term by \frac {\lambda}{m} versus just using plain \lambda, I think you could have done it either way. It’s just a constant after all. I have never actually seen an “official” explanation of this, but my theory is that the reason is that you want to make the selection of the \lambda hyperparameter orthogonal to the size of the training set. Note that when you do the training, you may have several different training sets, e.g. a smaller subset that you use early on to speed up the experimentation process when you’re playing with hyperparameters and then the full training set once you feel like you’re getting pretty close with the hyperparameter choices. It would be awkward if you had to separately tune the \lambda differently with the two differently sized datasets. The other intuition here is that the purpose of L2 regularization is to eliminate or mitigate overfitting. The other strategy for minimizing overfitting is to get more training data. So with the factor of \frac{1}{m}, in the limit as m \rightarrow \infty, the L2 term goes to zero. If you have the ability to add more data (which is not always practical), then you wouldn’t also have to do further fiddling with the \lambda value, in theory at least.
But don’t forget my disclaimer from earlier: this is just my theory and I don’t have any external evidence to support it.