L2 Regularization With Mini-batch GD

In L2 regularization, the new cost is computed as:

cost = J without regularization + (lambda / 2m) * (Frobenius norm of weights)

Here, the ‘m’ refers to the size of the entire training set. But, in mini-batch GD, we train on smaller batches of data at a time. My question is that, similar to how the ‘m’ in “J without regularization” changes to the size of the batch, does the ‘m’ in the regularization term also change to the size of the batch?

My intuition tells me that it does change to the batch size but I’d like some expert eyes on this.

Additionally, consider SGD with momentum The weight update is as follows:

W = W - lr * velocity, where velocity = beta * velocity + (1 - beta) * grad(W).

With L2 regularization, grad(W) can be written as:

grad(W) = grad(W) without regularization + lambda * W

My question is that, while updating velocity, should we only use “grad(W) without regularization” or should we use the above, new definition of grad(W)?

That is, should the weights be updated as:

W = W - lr * velocity - lr * W, where velocity is computed using the old, no-regularization definition of grad(W)

or just:

W = W - lr * velocity, where velocity is now computed on the new definition of grad(W).

Thanks!

Hi, @MalayAgr.

It’s the same m, the size of the mini-batch. Here’s an interesting discussion about this scaling factor.

When you add momentum, W is still calculated asimage. But V_{dw} depends on dw, which now has an additional term: image

Hope that helped :slight_smile:

1 Like