C2_W3_Lab 1_Model Evaluation & Selection (Scikit)

[Context: sklearn.preprocessing.PolynomialFeatures Class]

In this lab why did we use transform() method while creating a single model,
but when creating multiple models we are using fit_transform() method
i.e. NOT using the polynomial rules learnt from the training data.

(Assuming that mentors have access to the labs & corresponding codes. Hence did not copy the entire core cell.)

1 Like

Hello @Debatreyo_Roy,

It is a good question. Probably the most intuitive approach would be to fit with the training data, then use the fitted PolynomialFeatures object to transform the cv data, and I can assure you that such approach is correct and is always the approach to keep in mind.

However, for the case that you have mentioned, the only reason that ALLOWS us to do it differently, is that the PolynomialFeatures actually learns nothing from the training data. Nothing. Yes - the name fit suggests that something is to be learnt, but think carefully, if we set degree=2, do we really need to learn anything from the training data in order for us to preform a square on each of the features? No, if you set degree=2, such information is by itself enough for PolynomialFeatures to get the job done: it squares each of the features AND it multiplies any possible pairs of features up. Therefore, it does not actually learn anything, and that’s why no matter we have processed the training set before the cv set, we are only going to get the same transformed cv set at the end and you can verify that yourself. Try it. :wink:


1 Like

On second thought, I have missed out a necessary condition for us to be allowed to do fit_transform on cv set directly, which is we need to be sure that the training set and the cv set are having the same structure in terms of features.

It seems that PolynomialFeatures is able to remember the number of features and their names when it is being fit to a training set, and on transform, it seems to (verification needed) perform a sanity check on the incoming dataset.

This means that, if we fit it to a dataset of 4 features, then mistakenly transform a dataset of (somehow) 5 features, then it will raise an exception.

What I am saying is, if there is anything that PolynomialFeatures is learning from the training set, that would be the number of features and feature names, and they are for the purpose of sanity checks.

As far as the lab is concerned, we should not have any problem since the training set and the cv set have the same feature structure.

Anyway, even though it is okay for us to do fit_transform on the cv set for PolynomialFeatures there, I recommend you to stick with doing fit on the training set then transform on the cv set.

1 Like

I am considering to file a ticket for the course team to change that, but it is important that you know the rationale.

1 Like

PolynomialFeatures object —>

What exactly does it learn from the original data?
My understanding is it learns the input features and based on what degree was given as
argument it simply creates a polynomial of that degree by using the learnt input features.
Since original data (an input features) is not changing, for creating each new model with
different degree does not need to learn anything, instead simply keep combining the learnt input features
to make a polynomial of that degree.

Is this understanding correct? :thinking:

1 Like

Thank you. All these detailed explanations & reasonings you keep providing (for each query that I have posted till now) helps a lot.

Also, if you can point to some resource/course for learning Scikit library along with exercises/projects to practice it simultaneously that will be helpful.

1 Like

Hello @Debatreyo_Roy, I believe your “learns the input features” means my “remember the number of features and (if provided) the feature names”, and if I am right, then I agree.

As long as both the training set and cv set have the same set of features and haslve the features ordered in the same way, then it’s fine to fit_transform the cv set and training set separately, however, my recommendation will always be to transform cv set with the object fitted to the training set because it is a better practice.

I think a good project is to just work on any dataset you find on the internet and use sklearn whenever possible.

You might wonder when you know you can use sklearn, simply because you are not familiar.

The answer is simple: go through the MLS course 1&2, write down all the skills that you can apply, order them in sequence based on their roles that you have learnt, then for each skill, google sklearn <skill name>, then you will see whether sklearn can help you with that. After done the project, share a presentable notebook and ask for comments on where (how) you can further make (better) use of sklearn.

Obviously my above suggestion is not like a dedicated course, and if there exists such course I am sure you can google it out. Instead, it is a completely doable process requiring you to review the lectures, do a lot of thinking, googling, reading, trails, and you need to organize things yourself, so comparing to a course prepared by someone, it is going to take you more time, but your effort will pay off, and if you share presentable notebooks for the community to review, then you are not alone.


1 Like

A dedicated course saves time and gives a straight way, but going through my process gives the ups and downs which can let the skills sink in better and can broaden one’s horizon.

1 Like

Hi Raymond,
I must admit, this lab is hard. :sweat: But I can always learn something new by reading thru your responses to other learners questions :slight_smile:
I’ve a different question regarding classification in this lab: is the reason we need to covert y into 2D here because matrix multiplication, is not that for W and X shape, what it has to do with y shape? why the output of y shape is still (200,1)?

Many thanks

1 Like

Hello Christina @Christina_Fan,

Yes, you are right that it has nothing to do with X, so I think we can ignore image and focus only on image

Since it said “will require it”, I think the easiest way is to search y_bc and any other variable that is derived from it, and see what actually requires a 2D y.

The search is easy with Chrome’s Ctrl+F, and maybe this requires it → image, but I trust you know how to verify it and find other cases? :wink:


1 Like

Thank you for the hints Raymond. I can see both yhat and y_bc_train are a 1D array which can perform the np.mean() calculation. So I’m still unsure why it required a 2D y in the first place? :face_with_spiral_eyes:

1 Like

Nope, Christina @Christina_Fan, check the following screenshot out:

I need you to do checks like this :wink:

You may also comment out the code line that converts y into 2D and see if there would be any error. I mean, these “print-check” and “screwing-up-the-code” are the best way to come up with your own idea.


1 Like

Thank you for being so patient with me. so interesting, I also did those print-check, even before and after converting y_bc, I literarily did not pick up (200,1) is a 2D array, but now I got it. :yum:


Cool! So you had tried both! Then you had seen that, without converting y_bc, yhat != y_bc_cv could run through without error, but the result was different.

1 Like