In video Why Regulation Reduces Overfitting, it mentions that bigger Lambda leads to smaller W. Why so?
I suppose what Andew talked is about updating the weights by back-propagations with partial derivatives, dw = \frac{\partial J}{\partial w^{l}}.
If we add L2 regularization term with \lambda, dw also has the term of \frac{\lambda}{m}w^{l}. Updating the weights by back-propagations is to “subtract” dw from the previous weights. So, if we set a big \lambda, then, the weights go to the small values.
3 Likes