In L2 regularization, the new cost is computed as:
cost = J without regularization + (lambda / 2m) * (Frobenius norm of weights)
Here, the ‘m’ refers to the size of the entire training set. But, in mini-batch GD, we train on smaller batches of data at a time. My question is that, similar to how the ‘m’ in “J without regularization” changes to the size of the batch, does the ‘m’ in the regularization term also change to the size of the batch?
My intuition tells me that it does change to the batch size but I’d like some expert eyes on this.
Additionally, consider SGD with momentum The weight update is as follows:
W = W - lr * velocity
, where velocity = beta * velocity + (1 - beta) * grad(W)
.
With L2 regularization, grad(W)
can be written as:
grad(W) = grad(W) without regularization + lambda * W
My question is that, while updating velocity, should we only use “grad(W) without regularization” or should we use the above, new definition of grad(W)
?
That is, should the weights be updated as:
W = W - lr * velocity - lr * W
, where velocity
is computed using the old, no-regularization definition of grad(W)
or just:
W = W - lr * velocity,
where velocity
is now computed on the new definition of grad(W)
.
Thanks!