Prof. Ng in this video (at 4:30) said: “A couple of things I would like to point out by convention, instead of using lambda times the sum of w_j squared. We also divide lambda by 2m so that both the 1st and 2nd terms here are scaled by 1 \over 2m. It turns out that by scaling both terms the same way it becomes a little bit easier to choose a good value for lambda. And in particular you find that even if your training set size growth, say you find more training examples. So m the training set size is now bigger. The same value of lambda that you’ve picked previously is now also more likely to continue to work if you have this extra scaling by 2m.”
In modern frameworks such as TensorFlow and PyTorch, L2 regularization is implemented as weight decay and it doesn’t take into account the batch size:
-
TensorFlow
The L2 regularization penalty is computed as:loss = l2 * reduce_sum(square(x))
.
dense = Dense(3, kernel_regularizer= tf.keras.regularizers.L2(l2=0.01)
-
PyTorch
It is built into the optimizer:torch.optim.SGD(params, lr=0.001, weight_decay=0.01)
.
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)