Why is the L2 regularization term multiplied by (1/m) ? Isn’t regularization penalizing large weights ? Why would the number of samples make a difference when performing L2 regularization ? Is this there to make sure we regularize “less” if we have more data, since more data generally = lower chance of overfitting = less of a need for regularization ?

Two reasons.

First, the 1/m is to compute the average cost per example. This allows you to compare the cost relative to the training set size.

Second, large data sets naturally have higher variance, and so they require less regularization.

Prof Ng doesn’t really explain this, but Tom’s points seem the best explanation to me as well. Of course the best way to deal with overfitting is to get more training data (although that may not always be practical of course), so you could think of the need for regularization going to zero as m \rightarrow \infty.

The other way to look at Tom’s first point is that you’re making the selection of the hyperparameter \lambda orthogonal to the size of the training set. Prof Ng does make the point in general that it’s preferable to have your hyperparameters be orthogonal when possible. It just simplifies your search process for tuning them.

That makes a lot of sense, thanks a lot !