Trying out Model selection

Sure @ftayyab!

This goes back to the topic - model selection, and K-Fold CV.

A more general definition of model training includes (1) model selection and (2) literally fitting your model candidates with data. (2) uses training set whereas (1) relies the cv set. In this sense, both training and cv set can be seen as your training data because you use both sets to inform your decision makings in order to deliver your final, trained model.

test set, however, represents data in the production stage, so it’s used to assess your final, trained model. Before the assessment, test set is forgotten - we don’t use it in our model selection and model fitting processes. After the assessment and the result is bad, we forget it again - we don’t change the test set to avoid the possibility that the next assessment is improved because of the change in test set. We hope to improve the assessment result by training a better model.

Let’s say you have a fixed cv set (which is the case in your code), and you have N model candidates, and you evaluate each candidate with that one cv set, it’s called the 1-fold cross validation. So, a K-Fold CV means you have K different cv sets. Here’s a way you can generate them:

  1. From your whole dataset, leave out 20% as test set, and remaining 80% as training data.

  2. For 5-fold CV, split your training data into 5 slices. Each time pick one slice as the cv set and the rest as the training set. Train one of your model candidate with the training set and evaluate it with the cv. Repeat this until all slices have been served as cv. Then you get 5 evaluation scores for this candidate, and you may average them to get one final score for the candidate.

  3. Repeat step 2 for all candidates.

  4. Pick the candidate with the best final score.

Your use of generating Polynomial features creates model candidates. Degree 1 is one candidate, degree 2 is another, and so on.