Why don't we include regularization term in the training dataset?

Refer Week 3 lecture on evaluating the model (check the snapshot below).

I am confused regarding why we don’t include the regularization term in the training dataset?


You mean in the testing phase there is no regularization!

Regularization helps to fit the model to the data by suppressing weights, during testing you are not fitting anything, you just want to check if what you came up with during training is good enough in predicting unseen data. Regularization is only used to train the model!


No. I mean why aren’t we using it in training dataset. I get it why we wouldn’t use it in testing dataset but the lecture says not to include it in training dataset also (check the last line of lecture snapshot I attached. There is no regularization term in the cost function equation of the training dataset).


Hello @syedaskarimuslim,

Because at evaluation, we only care about how well the predictions are, regardless we are evaluating with a training set or a testing set.



Hello @rmwkwok
Regularization is an antidote to overfitting. I can understanding why we wouldn’t be concerned about regularization during testing, however, I can’t understand why wouldn’t we use this antidote in training to make sure our predictions are not coming out of an over-fitted curve.


Regularization is used during training to control overfitting. For this you use the regularized cost.

Once you have fit the model, now you just want to measure how well it works.

For this you use the unregularized cost. This is because now you do not want to include the additional penalties based on the weight values.


Hello @syedaskarimuslim,

As Tom explained, we need to be aware that there are two stages - a training stage and an evaluation stage. Your question focused on the former while the slide on the latter - at least this is what I got from watching the lecture, but if the lecture said anything which made you think otherwise, please share the exact time mark in the video and I will watch it again.



Refer Course-2, Week-3, Video titled “Evaluating a model”, timestamp: 5:35".


Thanks, @syedaskarimuslim. I have watched between 5:35 to 6:35 again.

In the training stage, exactly when we are applying gradient descent, we use the cost function that includes the regularization term on the training set.

In the evaluation stage when we are not thinking about gradient descent at all, we use the cost function without the regularization on both the training and the test set.

DIsagree with or unclear about anything that I said above?



Hi there,

in addition to the excellent replys of my fellow mentors:

The training data set is just pure data. When training the model (=fitting parameters), we optimize:

  • the model fit to the training data so that a good performance is reached on training data
  • and steer the complexity of the model with regularization

Afterwards we are done with training and there is nothing more to regularise at this point.

Then we just test how good the training was with respect to reality and new data which the model did not see before. Therefore, we provide the model with a unseen test set. Now we can evaluate how well the model performs on this new test set.

So simplified we can say:

  • if the model performs clearly worse compared to the performance on the training data, this indicates overfitting and this means that model complexity was potentially too high (which we were steering in the training with regularization) given the available data

  • if performance of the model on the new test set is comparable with the performance of the model on the training data and this suits your business requirements, this is a good sign, that regularization was effective and you could prevent overfitting, by keeping the model complexity in a state, where the model can generalise well (and does not overestimate noise too much)

Here more info, which also touches upon the validation data set: https://community.deeplearning.ai/t/regular-math-s-vs-ml/250791/2

Hope that helps!

Best regards


I think the confusion is to when exactly we use Regularization; let’s consider this from Ordinary Least Square Regression (OLS) closed form solution point of view.

In order to calculate weights (coefficients) in OLS we can use Normal Equation as follows w = XTX_inv.dot(X.T).dot(y), where:
X: training matrix of Features
y: training vector of Target values
XTX: X.T.dot(X), (a.k.a Gram Matrix, where T represents Matrix Transpose)
XTX_inv: Inverse of Gram Matrix

We can now use these coefficients to make predictions using the testing set: X_test.dot(w)

Now let’s Introduce regularization, everything from above applies except Gram Matrix is now:
XTX: X.T.dot(X) + alpha * np.eye(XTX.shape[0]), where alpha controls strength of regularization (0 = OLS)

So it’s the fitting stage (either using Closed form or GD), after which coefficients are being calculated, is where regularization takes place. Once we have final weights them we can proceed with predictions using testing set.