I am a bit confused on the L2 regularization video on the L2 weighting part (lambda).
I am not sure if this is standard with DL training using batches.
Normally while training CNN, I use L2 regularization by lambda * ||w||2, where lambda is a constant.
In the video, it is shown that the weighting should be divided by 2m, where m is the number of data.
After reading several sources, it does make sense in that context. However, the examples are using small datasets where the whole dataset fits into 1 epoch (iteration).
How does this apply to larger dataset where we train on batches? Does m stays as the number of examples or becomes the batch size? Would a constant lambda (without dividing with 2m) still works the same?
I think m in this case refers to batch size. So it works if the batch size is 1 (SGD) or if it contains all training examples (batch gradient descent), or in between which is minibatch gradient descent…