C2_W3_Lab 1_Model Evaluation & Selection (Scikit)

Debatreyo_Roy · September 13, 2023, 12:24pm

[Context: sklearn.preprocessing.PolynomialFeatures Class]

In this lab why did we use transform() method while creating a single model,
but when creating multiple models we are using fit_transform() method
i.e. NOT using the polynomial rules learnt from the training data.

(Assuming that mentors have access to the labs & corresponding codes. Hence did not copy the entire core cell.)

rmwkwok · September 13, 2023, 12:34pm

Hello @Debatreyo_Roy,

It is a good question. Probably the most intuitive approach would be to fit with the training data, then use the fitted PolynomialFeatures object to transform the cv data, and I can assure you that such approach is correct and is always the approach to keep in mind.

However, for the case that you have mentioned, the only reason that ALLOWS us to do it differently, is that the PolynomialFeatures actually learns nothing from the training data. Nothing. Yes - the name fit suggests that something is to be learnt, but think carefully, if we set degree=2, do we really need to learn anything from the training data in order for us to preform a square on each of the features? No, if you set degree=2, such information is by itself enough for PolynomialFeatures to get the job done: it squares each of the features AND it multiplies any possible pairs of features up. Therefore, it does not actually learn anything, and that’s why no matter we have processed the training set before the cv set, we are only going to get the same transformed cv set at the end and you can verify that yourself. Try it.

Cheers,
Raymond

rmwkwok · September 13, 2023, 1:08pm

On second thought, I have missed out a necessary condition for us to be allowed to do fit_transform on cv set directly, which is we need to be sure that the training set and the cv set are having the same structure in terms of features.

It seems that PolynomialFeatures is able to remember the number of features and their names when it is being fit to a training set, and on transform, it seems to (verification needed) perform a sanity check on the incoming dataset.

This means that, if we fit it to a dataset of 4 features, then mistakenly transform a dataset of (somehow) 5 features, then it will raise an exception.

What I am saying is, if there is anything that PolynomialFeatures is learning from the training set, that would be the number of features and feature names, and they are for the purpose of sanity checks.

As far as the lab is concerned, we should not have any problem since the training set and the cv set have the same feature structure.

Anyway, even though it is okay for us to do fit_transform on the cv set for PolynomialFeatures there, I recommend you to stick with doing fit on the training set then transform on the cv set.

rmwkwok · September 13, 2023, 1:11pm

I am considering to file a ticket for the course team to change that, but it is important that you know the rationale.

Debatreyo_Roy · September 16, 2023, 11:02am

PolynomialFeatures object —>

What exactly does it learn from the original data?
My understanding is it learns the input features and based on what degree was given as
argument it simply creates a polynomial of that degree by using the learnt input features.
Since original data (an input features) is not changing, for creating each new model with
different degree does not need to learn anything, instead simply keep combining the learnt input features
to make a polynomial of that degree.

Is this understanding correct?

Debatreyo_Roy · September 16, 2023, 11:06am

Thank you. All these detailed explanations & reasonings you keep providing (for each query that I have posted till now) helps a lot.

Also, if you can point to some resource/course for learning Scikit library along with exercises/projects to practice it simultaneously that will be helpful.

rmwkwok · September 16, 2023, 12:18pm

Hello @Debatreyo_Roy, I believe your “learns the input features” means my “remember the number of features and (if provided) the feature names”, and if I am right, then I agree.

As long as both the training set and cv set have the same set of features and haslve the features ordered in the same way, then it’s fine to fit_transform the cv set and training set separately, however, my recommendation will always be to transform cv set with the object fitted to the training set because it is a better practice.

I think a good project is to just work on any dataset you find on the internet and use sklearn whenever possible.

You might wonder when you know you can use sklearn, simply because you are not familiar.

The answer is simple: go through the MLS course 1&2, write down all the skills that you can apply, order them in sequence based on their roles that you have learnt, then for each skill, google sklearn <skill name>, then you will see whether sklearn can help you with that. After done the project, share a presentable notebook and ask for comments on where (how) you can further make (better) use of sklearn.

Obviously my above suggestion is not like a dedicated course, and if there exists such course I am sure you can google it out. Instead, it is a completely doable process requiring you to review the lectures, do a lot of thinking, googling, reading, trails, and you need to organize things yourself, so comparing to a course prepared by someone, it is going to take you more time, but your effort will pay off, and if you share presentable notebooks for the community to review, then you are not alone.

Raymond

rmwkwok · September 16, 2023, 12:25pm

A dedicated course saves time and gives a straight way, but going through my process gives the ups and downs which can let the skills sink in better and can broaden one’s horizon.

Christina_Fan · April 3, 2024, 2:19am

Hi Raymond,
I must admit, this lab is hard. But I can always learn something new by reading thru your responses to other learners questions
I’ve a different question regarding classification in this lab: is the reason we need to covert y into 2D here because matrix multiplication, is not that for W and X shape, what it has to do with y shape? why the output of y shape is still (200,1)?

Many thanks
Christina

rmwkwok · April 3, 2024, 7:42am

Hello Christina @Christina_Fan,

Yes, you are right that it has nothing to do with X, so I think we can ignore and focus only on

Since it said “will require it”, I think the easiest way is to search y_bc and any other variable that is derived from it, and see what actually requires a 2D y.

The search is easy with Chrome’s Ctrl+F, and maybe this requires it → , but I trust you know how to verify it and find other cases?

Cheers,
Raymond

Christina_Fan · April 4, 2024, 3:55am

Thank you for the hints Raymond. I can see both yhat and y_bc_train are a 1D array which can perform the np.mean() calculation. So I’m still unsure why it required a 2D y in the first place?

rmwkwok · April 4, 2024, 11:52pm

Nope, Christina @Christina_Fan, check the following screenshot out:

I need you to do checks like this

You may also comment out the code line that converts y into 2D and see if there would be any error. I mean, these “print-check” and “screwing-up-the-code” are the best way to come up with your own idea.

Cheers,
Raymond

Christina_Fan · April 6, 2024, 1:11am

Thank you for being so patient with me. so interesting, I also did those print-check, even before and after converting y_bc, I literarily did not pick up (200,1) is a 2D array, but now I got it.

rmwkwok · April 6, 2024, 4:34am

Cool! So you had tried both! Then you had seen that, without converting y_bc, yhat != y_bc_cv could run through without error, but the result was different.

Topic		Replies	Views
C2W3 Lab01 poly fit_transform CV no then yes Advanced Learning Algorithms week-module-3	8	647	April 15, 2023
PolynomialFeatures transform vs fit_transform Advanced Learning Algorithms week-module-3	3	764	February 17, 2023
C2W3_Lab_01_Model_Evaluation_and_Selection. Using fit_transform Advanced Learning Algorithms week-module-3	1	396	July 20, 2023
Detailed explanation of PolynomialFeatures usage Advanced Learning Algorithms week-module-3	3	535	December 24, 2022
Difference between .transform and .fit_transform when feature scaling Advanced Learning Algorithms week-module-3	1	17	January 2, 2025

C2_W3_Lab 1_Model Evaluation & Selection (Scikit)

Related topics