Why does the regularization term in L2 Regularization include division by the number of examples (m)?

I have a conceptual question regarding the commonly used L2 regularization term in machine learning models, specifically about the inclusion of the factor (1/m), where m is the number of training examples. Typically, the regularized cost function is defined as follows:

J = (1/m) * sum(L, from example 1 to m) + (lambda/2m) * sum(|w|^2, from layer 1 to L)

My confusion arises because the weight parameters w do not inherently depend on the number of training examples, m. Thus, intuitively, it seems to me that the regularization termsum(|w|^2, from layer 1 to L) should not change its relative magnitude if we increase or decrease the number of examples.

However, including the (1/m) factor explicitly makes the regularization term inversely proportional to the number of examples. In other words, with more examples, the regularization term becomes smaller (given fixed weights), and with fewer examples, it becomes larger. Note that the L term depends on the number of examples, thus it is logical the inclusion of the 1/m factor there. Therefore, the combination 1/m*sum(L) will not depend on the number of examples and should remain fixed for the same epoch and the same neural network architecture if we change just change m. But this is not the case for the regularization term.

If we omit the (1/m) factor in the regularization:

J = (1/m) * sum(L, from example 1 to m) + (lambda/2) * sum(|w|^2, from layer 1 to L)

then the regularization term would remain constant irrespective of the dataset size, which aligns more intuitively with my expectation that the regularization term should depend only on the weights themselves, not on the dataset size.

Could someone please clarify the following points:

  • Why exactly is the (1/m) term conventionally included in the regularization component?
  • How does the presence or absence of this factor affect the gradient descent updates practically?
  • Is there any rigorous mathematical or statistical reasoning that justifies using (1/m)?

Any detailed insights or references would be greatly appreciated.

1 Like

Division by ‘m’ reduces the regularized portion of the cost when the dataset is large.

This works empirically, because large datasets typically have higher variance, and so overfitting is much less of a problem.

2 Likes

Yes, you may get more training data later and including the factor of 1/m in the regularization term makes the hyperparameter \lambda “orthogonal” to the size of the training set. In the discussions of making hyperparameter choices, Prof Ng does mention that it’s advantageous to make the various hyperparameters independent when this is possible, because it simplifies the process of tuning your hyperparameters.