L2 regularization: lambda divided by 2m?

Hi everyone,

I am a bit confused on the L2 regularization video on the L2 weighting part (lambda).

I am not sure if this is standard with DL training using batches.
Normally while training CNN, I use L2 regularization by lambda * ||w||2, where lambda is a constant.

In the video, it is shown that the weighting should be divided by 2m, where m is the number of data.
After reading several sources, it does make sense in that context. However, the examples are using small datasets where the whole dataset fits into 1 epoch (iteration).
How does this apply to larger dataset where we train on batches? Does m stays as the number of examples or becomes the batch size? Would a constant lambda (without dividing with 2m) still works the same?

Thank you!

Some sources on that:

Hi @efer502,

I think m in this case refers to batch size. So it works if the batch size is 1 (SGD) or if it contains all training examples (batch gradient descent), or in between which is minibatch gradient descent…

2 Likes