L2 Regularization With Mini-batch GD

MalayAgr · July 1, 2021, 9:37am

In L2 regularization, the new cost is computed as:

cost = J without regularization + (lambda / 2m) * (Frobenius norm of weights)

Here, the ‘m’ refers to the size of the entire training set. But, in mini-batch GD, we train on smaller batches of data at a time. My question is that, similar to how the ‘m’ in “J without regularization” changes to the size of the batch, does the ‘m’ in the regularization term also change to the size of the batch?

My intuition tells me that it does change to the batch size but I’d like some expert eyes on this.

Additionally, consider SGD with momentum The weight update is as follows:

W = W - lr * velocity, where velocity = beta * velocity + (1 - beta) * grad(W).

With L2 regularization, grad(W) can be written as:

grad(W) = grad(W) without regularization + lambda * W

My question is that, while updating velocity, should we only use “grad(W) without regularization” or should we use the above, new definition of grad(W)?

That is, should the weights be updated as:

W = W - lr * velocity - lr * W, where velocity is computed using the old, no-regularization definition of grad(W)

or just:

W = W - lr * velocity, where velocity is now computed on the new definition of grad(W).

Thanks!

nramon · July 1, 2021, 4:18pm

Hi, @MalayAgr.

It’s the same m, the size of the mini-batch. Here’s an interesting discussion about this scaling factor.

When you add momentum, W is still calculated as. But V_{dw} depends on dw, which now has an additional term:

Hope that helped

Topic		Replies	Views
Questions on regularization Improving Deep Neural Networks: Hyperparameter tun	2	461	July 17, 2023
Week 1: Back propagation with L2 regularization Improving Deep Neural Networks: Hyperparameter tun	1	818	June 24, 2021
L2 regularization: lambda divided by 2m? Improving Deep Neural Networks: Hyperparameter tun	1	680	June 17, 2021
Confused about Mini-Batch Gradient Descent Improving Deep Neural Networks: Hyperparameter tun	3	556	May 9, 2022
Mini Batch Gradient Descent vs Batch GD Improving Deep Neural Networks: Hyperparameter tun	2	559	May 23, 2021

L2 Regularization With Mini-batch GD

Related topics