# L2 Regularization With Mini-batch GD

In L2 regularization, the new cost is computed as:

`cost = J without regularization + (lambda / 2m) * (Frobenius norm of weights)`

Here, the ‘m’ refers to the size of the entire training set. But, in mini-batch GD, we train on smaller batches of data at a time. My question is that, similar to how the ‘m’ in “J without regularization” changes to the size of the batch, does the ‘m’ in the regularization term also change to the size of the batch?

My intuition tells me that it does change to the batch size but I’d like some expert eyes on this.

Additionally, consider SGD with momentum The weight update is as follows:

`W = W - lr * velocity`, where `velocity = beta * velocity + (1 - beta) * grad(W)`.

With L2 regularization, `grad(W)` can be written as:

`grad(W) = grad(W) without regularization + lambda * W`

My question is that, while updating velocity, should we only use “grad(W) without regularization” or should we use the above, new definition of `grad(W)`?

That is, should the weights be updated as:

`W = W - lr * velocity - lr * W`, where `velocity` is computed using the old, no-regularization definition of `grad(W)`

or just:

`W = W - lr * velocity,` where `velocity` is now computed on the new definition of `grad(W)`.

Thanks!

Hi, @MalayAgr.

It’s the same `m`, the size of the mini-batch. Here’s an interesting discussion about this scaling factor.

When you add momentum, W is still calculated as. But V_{dw} depends on dw, which now has an additional term:

Hope that helped

1 Like