Trying out Model selection


I think week 3 is one of the best set of lessons as it answers so many of the questions I had while trying to learn about ML and I am glad it has been covered in this lesson.

I have gone through the week lectures and excercises and wanted to try it on my own implementation of linear regression. I used the diabates dataset from sklearn and was able to train the model, get a nice history/cost graph going down. I am using

iterations = 2000
alpha = 0.01

However when I try to calculate training error and validation error I get the values of:

train: 998.2554455885964

validation: 330.0409308064131

To me these values are pretty high (so means the linear model has high bias), I then changed the linear model to polynomial model using the PolynomialFeatures class from sklearn and used degree 2. I get slight better results

for x^ 2 the error are

train: 909.9086780179686

validation: 314.8579584817676

This shows there is high bias but using Polynomial I get slight better errors.

Now I tried x ^ 5 for Polynomial degree,
i get
train: 9.139249180757097e+195
validatior error : 2.7066317016026545e+195,

and my history.cost graph goes up which means now I have to change alpha.

I do not understand why the history/cost graph goes all wrong on changing the degrees and also whether my assumption of high bias problem for this case is correct.

Any guidance is appreciated.

Happy to post the cost if that helps

I get the following

for linear the errors are

train: 998.2554455885964

validation: 330.0409308064131

for x^ 2 the error are

train: 909.9086780179686

validation: 314.8579584817676

for x^ 3 the error are

train: 761.6635303197003

validation: 357.26960453784125

for x^ 4 the error are

train: 567.5263972928071

validation: 398.56303289812996

Does this mean I should choose x^2 as that shows the lowest validation error but what about the bias issue or have I got it wrong?


It’s hard to recommend any of the models because their training error is higher than the validation error.
What that’s saying is that the models are a better fit to data that it wasn’t trained on.
This raises a few questions like:
Are there outliers in the training set?
Is the validation set an accurate representation of the training set?
Is our training methodology correct?

I would make new models using K-fold Cross Validation and look at both the training data and validation data.

also check out this stackexchange post for more details

Hope this helps!

Hi Sam,

The following code is what I have developed and used. Can you spot any issues:

import numpy as np
import math
import matplotlib.pyplot as plt

# Regression problem
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

dataset = load_diabetes()
X =
y =
# Convert to non-linear
trans = PolynomialFeatures(degree=4, include_bias=False)
X = trans.fit_transform(X)

X_train, X_, y_train, y_ = train_test_split(X, y, test_size=0.40, random_state=1)
X_cv, X_test, y_cv, y_test = train_test_split(X_, y_, test_size=0.50, random_state=1)

def normalizeX(X):
    mu = np.mean(X, axis = 0)
    sigma = np.std(X, axis = 0)
    X = (X - mu) /sigma
    return X

def addBias(X):
    bias = np.ones((X.shape[0],1))
    X = np.hstack((bias, X))
    return X

X_train = addBias(X_train)
y_train = y_train.reshape((y_train.shape[0],1))
W = np.zeros((X_train.shape[1], 1))

def calculate_H(X, W):
    h = X @ W;
    return h

def calculate_cost(X, y, W):
    m = X.shape[0]
    err = (calculate_H(X, W) - y) ** 2
    return 1/ (2 * m) * np.sum(err)

def gradient_descent(X, y, W, iterations):
    m = X.shape[0]
    alpha = 0.01
    history = []
    for i in range(iterations):
        W = W - alpha * (1/m) * (np.transpose(X) @ (calculate_H(X, W) - y))
        cost = calculate_cost(X, y, W)
        #print(f'Cost: {cost} for iterations: {i}')
    return (W, history)

iterations = 500
(final_W,history) = gradient_descent(X_train, y_train, W,  iterations)

def evaluate_model(yhat, y):
    m = X.shape[0]
    err = (yhat - y) ** 2
    return 1/ (2 * m) * np.sum(err)

def predict(X, W):
    return X @ W

yhat = predict(X_train, final_W)
train_error = evaluate_model(yhat, y_train)

X_cv = normalizeX(X_cv)
X_cv = addBias(X_cv)

yhat = predict(X_cv, final_W)
y_cv =  y_cv.reshape((y_cv.shape[0],1))
cv_error = evaluate_model(yhat, y_cv)

sorry what is meant by k-fold cros validation?

Hi @ftayyab,

Your code is basically fine, but you didn’t normalize your train_X. I made the following changes and it seemed to work well.

  1. Commented out these. Let’s start from the simplest.
# trans = PolynomialFeatures(degree=4, include_bias=False)
# X = trans.fit_transform(X)
  1. added these 2 lines. Note here that I intentionally normalize X_train with X_train, so you need to keep a copy of X_train, so you can reproduce the same mean value and std value to normalize your cv dataset and test dataset. (Or save the mean and std of your X_train instead making a backup copy of the whole X_train)
>>> X_train_backup = X_train.copy()
>>> X_train = normalizeX(X_train)

X_train = addBias(X_train)
  1. added these 6 lines. i added 2 other ways to find optimal w for you to see if your result is reasonable.
iterations = 1000
(final_W,history) = gradient_descent(X_train, y_train, W,  iterations)

>>> from sklearn.linear_model import LinearRegression
>>> w_reference_1 = LinearRegression(fit_intercept=False).fit(X_train, y_train).coef_.T
>>> w_reference_2 = np.linalg.solve(X_train.T @ X_train, X_train.T @ y_train)
>>> print('final error (gradient descent)', calculate_cost(X_train, y_train, final_W))
>>> print('final error (sklearn LinearRegression)', calculate_cost(X_train, y_train, w_reference_1))
>>> print('final error (closed form solution)', calculate_cost(X_train, y_train, w_reference_2))

I didn’t go further down your notebook. After I had run the code with the changes applied, I got

final error (gradient descent) 1430.6458102118213
final error (sklearn LinearRegression) 1418.2963501003694
final error (closed form solution) 1418.2963501003694

And they are quite comparable. You may adjust your learning rate and number of iteration to get better gradient descent error.

Now if you are happy with these results for the simplest features setting, you may move on to features of higher degree if needed.

Summary notes:

  1. normalizing your training data lets gradient descent work better
  2. check your result with existing methods
  3. adjust your hyperparameters (e.g. learning rate and number of iterations) for gradient descent. Do the same if needed when you use sklearn’s LinearRegression.
  4. Besides comparing the cost values among methods, as a good practice in my opinion, also take a look at the weights from each of them. I checked after running the above changed codes, and they were quite similar.

Additional side notes:

  1. seems you don’t use X_test at all.
  2. I didn’t go further down your notebook, but make sure to normalize X_cv using the mean and the std from X_train_backup.

Hello rmwkwok,

Thank you for the reply and spending time on reviewing the code. I think I made a mistake of not using the same mean and sigma for the validation dataset as well.

Although I would like to clarify something:

From the lesson it seems that prediction is made on train data and then the train error is calculated:

yhat = lmodel.predict(X_train)
err_train = lmodel.mse(y_train, yhat)

which is what I believe you are also using in the lines you have added, however I am not sure why the numbers I am getting are different. Any idea. I am using evaluate_model method (which is same as the calculate_cost function. Is there anything wrong with this function?

`def evaluate_model(yhat, y):
    m = X.shape[0]
    err = (yhat - y) ** 2
    return 1/ (2 * m) * np.sum(err)

In your evaluate_model, there is this line m = X.shape[0] which refers to a X that’s not among the input variables, so I suppose Python will use the X defined outside of the function which is X = trans.fit_transform(X) which is before you split the data. So you are using a m that’s too large for X_train.

Excellent that was it. Thanks for pointing it out. Now I am getting the same results.

For my clarification if you can answer the following, will really appreciate:

  1. What is the use of test datasets, in terms of the output it provides. How do we interpret the results.
  2. This might be simple but getting a value of 1432.78084 as the error for training and validation set, doesnt it mean that the values itself is too high in terms of accuracy or am I understanding it wrong. Having a total error cost nearer to 0 would be ideal.

Sure @ftayyab!

This goes back to the topic - model selection, and K-Fold CV.

A more general definition of model training includes (1) model selection and (2) literally fitting your model candidates with data. (2) uses training set whereas (1) relies the cv set. In this sense, both training and cv set can be seen as your training data because you use both sets to inform your decision makings in order to deliver your final, trained model.

test set, however, represents data in the production stage, so it’s used to assess your final, trained model. Before the assessment, test set is forgotten - we don’t use it in our model selection and model fitting processes. After the assessment and the result is bad, we forget it again - we don’t change the test set to avoid the possibility that the next assessment is improved because of the change in test set. We hope to improve the assessment result by training a better model.

Let’s say you have a fixed cv set (which is the case in your code), and you have N model candidates, and you evaluate each candidate with that one cv set, it’s called the 1-fold cross validation. So, a K-Fold CV means you have K different cv sets. Here’s a way you can generate them:

  1. From your whole dataset, leave out 20% as test set, and remaining 80% as training data.

  2. For 5-fold CV, split your training data into 5 slices. Each time pick one slice as the cv set and the rest as the training set. Train one of your model candidate with the training set and evaluate it with the cv. Repeat this until all slices have been served as cv. Then you get 5 evaluation scores for this candidate, and you may average them to get one final score for the candidate.

  3. Repeat step 2 for all candidates.

  4. Pick the candidate with the best final score.

Your use of generating Polynomial features creates model candidates. Degree 1 is one candidate, degree 2 is another, and so on.



Indeed. 1432 is the best you can ever get given the current model assumption - without polynomial features. So you may try other assumptions - such as using polynomial features or use decision trees instead of linear regression or any other ways.

When you have some candidates in mind, do try the K-Fold CV approach to make your final decision :slight_smile:

I would like to response to this in a separate reply.

It’s not ideal to have zero training/cv error, unless your data is ideal - without noise, and unless your model assumption is perfect which is very unlikely.

Real world data always have random noise and we don’t model those noise and predict it. Also, our dataset is sometimes insufficient to inform the predicting variable because our ability in collecting all data that causes the predicting variables is limited.

On the other hand, knowing the perfect model assumption means we are completely on top of the process we are trying to model, which is true in an artificial computer simulated world but not in our lovely real world.

Therefore, having errors is reasonable and not avoidable and the size of it depends on our data quality and model assumption quality. At least we can try different model assumptions and use the best one with the lowest cv error!

Thanks for the detailed response. All great and makes sense. True in the real world we wont have the perfect dataset and perfect models. Does this also mean a model designed to detect digits lets say on a credit card might fail to read the digits correctly, 5 out of 100 times. Is that a possibility or we can be sure that the digit recognition will work correctly 100% of the time.

Also considering that the surrounding (environmental conditions) remain same.

Indeed this is how we try to make sure the model can work almost 100% of the time, by controling the environment. The only thing you can’t control is probably the credit card itself. What happen if the card happens to have some dirt on it, or some scratches on it or those digits are somehow damaged?

Dirt, scratch and damage are examples of noise you can’t control and can’t always model. They are not always over the digits; they are by chance; they are not regular; they are not features that help you make better predictions.

@rmwkwok -
This is a great response, I hope others stumble across this answer like I did.

I was going to write a separate question, but I will try to fit it in here. I am hoping for some clarifications regarding the K-fold cross validation process and what happens afterwards. Basically, how do we proceed after step 4? Below is the questions set up/background so you can see exactly what I mean, followed by the actual questions.

Questions set up:
Background of hypothetical problem: Let’s assume we have a NN model and we want to tune a lambda only, we choose 3 values to investigate, and we do a 5 fold CV as you explain in your steps 1 through 4. So that is 15 models/weights that we will end up with. Let’s also assume that for each split we normalize the data: (as you have verified in other questions for me and in the course) → by finding the mean and Standard deviation (Std) from the training data and use it on the training data to normalize and then using the same mean and Std to normalize the CV data. After all of this, each of the three lambda’s will have 5 error scores for each fold, which we can average for each lambda, this is your step 4, which all makes sense.

Let’s say a lambda of 0.02 was chosen as the lambda associated with the lowest CV error scores. So, now let’s assume that we have a separate set of data that was the true test data and was never seen by the training or the in the CV hyper parameter turning. The general idea is that we will use the lambda and model that gave us the lowest CV error score, which sounds great and easy, but I’m having trouble with the details.

Question 1: Which data set do we use to normalize the test data?
For each of the 15 models we ran, we had different normalized training/cv data that relied on different mean and standard deviations to normalize (since each fold has a different training data set). So, for the separate test data, which mean and standard deviation do we use to normalize that data? I’ve read and confirmed that we are supposed to use the training mean and standard deviation, but there are 5 of them for the final chosen lambda of 0.02, should I find the average mean and standard deviation of the 5 training sets, or something else?

Question 2: Which “model” (learned weights) do we use on the test data?
After we normalize the test data set (somehow, see Q1 for answer), similar to Question 1, which model do we use on the test data. I know we will use the model that uses the chosen lambda value (of 0.02), but there will be 5 models from the train/CV hyper tuning process. When I say “model” I mean a model with different weights that were learned on each of the 5 training data sets for a lambda of 0.02. A guess: Maybe we use the lambda (plus other parameters) to learn the weights on the test data itself and then use those weights along with the correct hyper parameters to be used in the .predict(test_data_here) on the test data? Or is there some other way? Also, what happens if your separate test set is just one row, ie, one case with the appropriate inputs, using that to learn weights wouldn’t make any sense.
Another way to ask this question: Can we use a previously found model (from the train/cv data) and it’s learned weights (and just skip to the .predict part), or do we need to use the same parameters (lambda etc) and find new weights on the test data (hoping that our test data is more than one data point)?

Question 3: What does the code look like and is there anything beyond using .predict and finding the error on the test data?
Let’s say we have chosen a way to normalize the test data, and a way to choose which learned weights/model to use on the test data. Then how do we actually move from there? I’m assuming it looks something like this pseudo-code?

→ Load separate/unseen test data
→ Normalize data somehow (Question 1)
→,y_test?) # May not be necessary if you choose a model from the train/cv step (Q2 related)
→ model[model_number_here].predict(test_data_here) (Q2 related)
→ calculate error on test data

Hopefully these questions make sense. I can move this to a separate topic too if needed. Thanks!

Hello @naveadjensen!

Great question!

From your question, you know very clearly that,

  1. one model is associated with one set of normalization constants
  2. the best lambda is associated with 5 models in a 5-fold CV.
  3. but we need one model for the test data

There are multiple ways to continue after step 4 in this post.

Possible way A

step 5: use all the 80% training data to find a set of normalization constants and train a new model
step 6: given a test set of any size (or the held-out test set from step 1), normalize it with the constants in step 5, and make prediction with the model in step 5.

Possible way B

step 5: in training a GBDT, or a NN, you may want to use their “early stopping” features. In that case, split the 80% training data into a “training set” and a “validation set”, only this time we don’t need to do it 5 times but just once. Find normalization constants using only the training set, and normalize the validation set. Train a new model by giving both the training set and the validation set to the \algorithm that supports “early stopping”, and you will end up with one set of normalization constants, and one model.

step 6: given a test set of any size (or the held-out test set from step 1), normalize it with the constants in step 5, and make prediction with the model in step 5.


@rmwkwok -
Thanks as always.
Can you please verify the following two statements for me to make sure I fully understand:

1.) When you say: “use all the 80% training data…”, are you referring to the 80% (training/cv) data that we are using and splitting on during the K-fold hyper parameter tuning part of the process? (except when calculating the mean and Std for use in the 20% test set, we use the entire 80% and don’t do a training/cv split).

2.) By “normalization constants”, do you mean the mean and standard deviation that are calculated by the entire 80% training data (without any training/cv splits)?

If I understand what you are saying, and both answers are “yes” then Possible way A makes sense.

I will have to think about Possible way B a little more and look more into the early stopping option, sounds interesting from what I’ve read so far.


Hello @naveadjensen,

The 80% refers to that in the step 1 of my previous reply.

This is the case described in possible way A:

That’s a good idea!