Model Selection based on CV or Test & Diff b/w CV and Test data

Tayyab_Shafiq · September 22, 2023, 7:03am

I think model selection should be done after the ‘test error’ because our essential goal is to get the lowest generalization error so it only finds after the test error is calculated. because if we made a model selection based on a cross-validation set and chose the lowest cv model then maybe the other models are good at test data or have low generalization error.

The other question is what is the main difference c between CV and test data we do the same thing with CV and test data which is only prediction on data but not training …

Deepti_Prasad · September 22, 2023, 7:28am

Hello,

The cross validation data is used during the training phase of the model to provide an unbiased evaluation of the model’s performance and to fine-tune the model’s parameters.

Where as the test data is used after the model has been fully trained to assess the model’s performance on completely unseen data.

Regards
DP

Tayyab_Shafiq · September 22, 2023, 12:21pm

In the model evaluation and selection lab, we only train the (Training data) means we use model.fit() and also model.predict() for only training data but not for (CV data). for (CV and Test data) we only use model.predict()…so you say cv data is used during training phase.
please help and clarify for me. thanks

Deepti_Prasad · September 22, 2023, 1:11pm

If you have done tensorflow course, you will come along assignment or notebook, where first training-validation set are defined, then a model architecture is made and that algorithm is trained. Basically Cross-validation dataset is used to create a best model algorithm, so that the test dataset can be trained on that model.

So cross validation is during the training phase of the model to provide an unbiased evaluation of the model’s performance and to fine-tune the model’s parameters.

Tayyab_Shafiq · September 22, 2023, 2:21pm

Look into that code:

train_mse_list =
cv_mse_list =
models =
scalers =

for degree in range(1,11):

#              Training data

# Polynomial
poly_train = PolynomialFeatures(degree, include_bias=False)
X_train_poly = poly_train.fit_transform(X_train)

# Feature Scaling 
scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
scalers.append(scaler)

# Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_poly_scaled, Y_train)
models.append(linear_model)

 # Predictions
Y_hat = linear_model.predict(X_train_poly_scaled)
train_mse = mean_squared_error(Y_train, Y_hat) / 2
train_mse_list.append(train_mse)


#              Cross-validation data

# Polynomial
poly_cv = PolynomialFeatures(degree, include_bias=False)
X_cv_poly = poly_cv.fit_transform(X_cv)

# Feature Scaling
X_cv_poly_scaled = scaler.transform(X_cv_poly)

# Linear Regression
Y_hat = linear_model.predict(X_cv_poly_scaled)
cv_mse = mean_squared_error(Y_cv, Y_hat) / 2
cv_mse_list.append(cv_mse)

print(f"Training MSE’s: {train_mse_list}“)
print(f"Cross-Validation MSE’s: {cv_mse_list}”)
#print(f"W: {w_list}“)
#print(f"b: {b_list}”)

#Finding lowest CV_MSE

Get the model with the lowest CV MSE (add 1 because list indices start at 0)

This also corresponds to the degree of the polynomial added

degree = np.argmin(cv_mse_list) + 1 # return 4 → (3 +1)
print(f"The Lowest CV MSE found at Degree: {degree}")

Publish the generalization error using Test Set

poly_test = PolynomialFeatures(degree, include_bias=False)
X_test_poly = poly_test.fit_transform(X_test)

X_test_poly_scaled = scalers[degree-1].transform(X_test_poly)

Y_hat = models[degree-1].predict(X_test_poly_scaled)

test_mse = mean_squared_error(Y_test, Y_hat)

print(f"Training MSE: {train_mse_list[degree-1]:.2f}“)
print(f"Cross Validation MSE: {cv_mse_list[degree-1]:.2f}”)
print(f"Test MSE: {test_mse:.2f}")

Deepti_Prasad · September 22, 2023, 3:12pm

this codes are from where?

Tayyab_Shafiq · September 22, 2023, 4:08pm

Machine learning specialization, Advance learning alog course week 03, model evaluation and selection lab

Deepti_Prasad · September 22, 2023, 4:12pm

if this programme assignment is grader assignment, please remove the codes from your post’s comment.

Can I know what are you trying to explain with these codes?

TMosh · September 22, 2023, 6:48pm

The training set is used to train the model.
Some parameters of the model are optimized by using the model’s performance on the validation set (i.e. adjusting a regularization parameter).
The performance of the completed system is measured just once, using a test set.

If the test set performance is not sufficient, then you go back to the beginning and try a more complicated model.

Tayyab_Shafiq · September 26, 2023, 5:40am

This is an optional lab.
I mean we do the same thing with cv and testing data which is only prediction so why we choose these two cv & test we may use only one set for selection.

Deepti_Prasad · September 26, 2023, 6:02am

That is a good question and I know most of the time, this is the confusion comes related to cross validation and test dataset. But if you are thinking both data are used only to get the prediction, then this is a bit incomplete understanding.

Cross validation data is basically used to check how really good is your model in relation to training data where as test data usually check how the model would perform which has been trained on a cv data.

CV data gives unbiased evaluation of the model’s performance and to fine-tune the model’s parameters where the test dataset is used after the model has been fully trained (this fully trained means evaluation i.r.t. to cv data) to assess the model’s performance on completely unseen data.

Just imagine in general terms, you prefer taking a quality product like an iPhone from an apple showroom than from any normal shop. Why? Because one would think people at the apple showroom would have better knowledge about what they are selling and you would get a quality product.

CV data does something like same with the training dataset than compare to the test dataset. Like how Tom mentioned in CV data we optimise the model in a way to get as best as model can perform but where as test dataset is used on training dataset which has been performed or checked with this CV data for creating the most robust model performance.

So CV data is like food inspector for a restaurant and customers are like test dataset and both are basically try and testing the food at a restaurant.

I have given you two examples, hope you understood now.

Regards
DP

Tayyab_Shafiq · September 26, 2023, 6:07am

I understood, Thanks

TMosh · September 26, 2023, 3:44pm

We make the same measurement, but the data sets have been handled differently. The validation set was used to optimize the model. The test is is brand-new and never touched the training or optimization process. So the test se simulates how the completed model will work on practicall new data.

JedX · December 6, 2023, 7:09am

Is CV data used later in the process to adjust model weights as well - fine tune?

If it’s never used that way, would it be ok to use just Test-set?

TMosh · December 6, 2023, 7:15am

Do not use the CV data for training.
You need three sets:

One for training.
One for adjusting the model parameters (regularization, etc).
One for testing the completed model.

JedX · December 6, 2023, 8:07am

What’s the difference between training and adjusting model parameters? Do you mean hyper-parameters that you set manually or via some GridSearch etc. ?

Deepti_Prasad · December 6, 2023, 9:37am

Hello JedX,

As Tom mentioned we are not suppose to use CV data for training, so no we do not use CV data to adjust model weights as well- fine tune.

Your second question

Is again No, as training data is used to adjust model weights or fine tune your model.

We adjust model parameters (eg. - hyperparameters) and fine tuning model using training data.

Training data is a subset of the dataset used to build predictive models.
Validation data(CV) is a subset of the dataset used to assess the performance of model built in the training phase.
Test dataset or unseen examples is a subset of the dataset to assess the likely future performance of a model(this model is created using training data).

Hope it clears your doubt!!!

Regards
DP

TMosh · December 6, 2023, 2:43pm

Yes.

Topic		Replies	Views
Why do we need to have a validation set for training? Advanced Learning Algorithms week-module-3	17	937	February 8, 2023
Train,dev set Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	1	13	October 25, 2024
Data Set Questions AI Discussions ai-discussions	3	131	March 8, 2024
Should I train cross validation and test data after training the model with training data? Advanced Learning Algorithms week-module-3	3	484	March 20, 2023
C2_W3 Model selection and training/cross validation/test sets Advanced Learning Algorithms week-module-3	11	614	April 1, 2024

Model Selection based on CV or Test & Diff b/w CV and Test data

Get the model with the lowest CV MSE (add 1 because list indices start at 0)

This also corresponds to the degree of the polynomial added

Publish the generalization error using Test Set

Related topics