CW W2 Lab 4: Creating feature vs changing model

ybakos · May 13, 2023, 8:10pm

At the end of Week 2, we consider polynomial regression. In Lab 4, I assumed that our model would change from a linear one (y = mx + b) to a polynomial one (y = x^2 + 1). I was surprised that, instead, we created a new feature representing x^2, and used the same linear model.

Why did we create a new feature representing x^2, rather than using a non-linear model? Would the two different approaches (creating a feature for x^2, versus using a non-linear model of y=w*x^2 + 1) give us the same results? What would the benefits and drawbacks to the two approaches be?

TMosh · May 13, 2023, 8:24pm

We’re making a more complex data set, by including new features that are non-linear combinations of the original features.

We can still use a model based on their linear combinations (the sum of each each feature multiplied by a weight). This means we only really have one cost function, regardless of the complexity of the data set or the number of terms in the summation.

ybakos · May 13, 2023, 8:48pm

Thank you for the response. I find this approach pretty versatile.

But, why have you brought up the cost function? If my model is, say w*x^3, isn’t my cost function still the same no matter what the model function is? If my cost function is mean squared error, I still compute the average of ((f(x) - y) * x)^2, right? Just because f(x) is now w*x^3 instead of wx+b shouldn’t change the cost function… it is still mean squared error, right?

Perhaps I can rephrase my question. Can I use gradient descent and “polynomial regression” to fit a “curve” in the data set by leaving my input, X, alone, and using a quadratic function as the model, instead of a linear function?

If not, why not? If so, then why not do this instead of creating features and using a linear model?

TMosh · May 13, 2023, 9:35pm

Yes, it is. My opinion, there really isn’t anything called “polynomial regression”. It’s better to think of it as linear regression using polynomial features.

ybakos · May 13, 2023, 10:02pm

Got it, thanks.

So then, theoretically speaking, would a regression algorithm (cost, gradient descent) using a non-linear function (eg a quadratic, log, sqrt, trig) still “converge” if the data set could be fit by the function?

I’ll try some digging online and work some examples out by hand.

TMosh · May 14, 2023, 1:11am

What are you referring to when you say “using a non-linear function”? The cost equation, the predictions, or something else?

ybakos · May 14, 2023, 1:31am

Prediction. I understand the “model” within linear regression to be a function, f(x) = wx + b. I’m curious if a ‘non linear regression’ model can make predictions with a function like f(x) = wx^2 + b.

TMosh · May 14, 2023, 2:07am

Yes. You do it by squaring x first and including it in the training set.

Don’t think of ‘x’ as a variable. Those are all the features in the training set. They’re just real values.

TMosh · May 14, 2023, 2:08am

So if your training set just had one feature ‘x’, then you might add features of x^2, x^3, etc.
Then when you include them in the training set, each example will have features x, x^2, and x^3. Each feature has a weight, so you’ve have three ‘w’ values.

Then you train and learn the weights.

ybakos · May 18, 2023, 11:22pm

Thank you for the additional responses. I am not sure if my question is being understood correctly. I do understand that I can create a new feature from another one, computed as the square of the original, for each sample in the data set. I know that x is not a variable, but a feature. I know that x^2 would be a new synthesized feature.

My question is, if I have:

Feature x and feature x^2
A linear model that uses the function y = wx + b
And I provide the model the feature values x^2, as the x input of the linear function

Is this equivalent to:

Feature x
A nonlinear model that uses the function y = wx^2 + b
And I provide the model the feature values x, as the x input of the nonlinear function

Would I get the same result?

If so, why don’t we create models that use non-linear functions in their implementation, rather than computing synthesized features (eg x^2) and sticking with the usual linear function y = wx + b? Is it for flexibility and extensibility? (Is it just easier or more flexible to compute x^2 or whatever we want, rather than always changing the model’s function?)

If not, why not?

TMosh · May 19, 2023, 12:12am

Just to be clear, when you add features, ‘x’ is no longer a scalar. It’s a vector, and ‘w’ is a vector. So w*x is a dot product for each example.
x[0] will be the original ‘x’ value.
x[1] will be the new x^2 value.
Using this method allows for both a linear and non-linear characteristic (such as the shape being a parabola that is offset in either axis).

Also, to be clear, they don’t equal ‘y’, because those are the labels. It equals a prediction, I’ll call it ‘h’ here (for the hypothesis).

So if you write out the equation after adding the squared feature, you have for each example:
h = x[0]*w[0] + x[1]*w[1] + b

Regarding your last question - yes, you could use any non-linear function you like to create new features. You’d have to try a lot of them to get one that works well for a specific set of data.

But polynomials have some advantages:

Polynomial series are able to approximate any sort of real function.
Polynomials are very easy to compute. Other non-linear functions like logs, square roots and trig functions are not very easy.

rmwkwok · May 19, 2023, 12:24am

I think the difference is whether we want to pre-compute the non-linear features once and for all, or compute it every time from the raw x at each round of gradient descent. If we are ready to run 10000 rounds of gradient descent, the difference will be pre-computing each non-linear feature for 1 time, or computing it 10000 times for each of the non-linear features at each round of the gradient descent.

Obviously, pre-computing it saves time.

BTW, I think most packages assume you have pre-computed it as it always go the linear combination way. I have not seen any popular package that will allow you to program that, not at least recently. I had only done that like > 13 years ago…

Cheers,
Raymond

ybakos · May 19, 2023, 10:41pm

Thanks @rmwkwok . The pre-computing advantage is one I hadn’t thought of.

rmwkwok · May 19, 2023, 11:05pm

You are welcome. Changing from x_2 to x^2 in the model form has no other effect on the training result. I think the most critical point of consideration is how it affects the weight update formula, but since w_2 is always linear to y no matter it is x_2 or x^2, it just has no effect except for the amount of time to run.

Cheers,
Raymond

ybakos · May 20, 2023, 1:23am

I believe I underestimated how complex a polynomial might get and the advantages of pre computing make sense.

Topic		Replies	Views
C1_W2_Lab04_FeatEng_PolyReg_Soln Supervised ML: Regression and Classification week-module-2	3	500	March 19, 2023
Model Evaluation - are we really changing model? Advanced Learning Algorithms week-module-3	7	299	October 23, 2023
Optional lab: Feature engineering and Polynomial regression Supervised ML: Regression and Classification week-module-2	1	547	July 11, 2022
Practice quiz: Gradient descent in practice Q5 Supervised ML: Regression and Classification week-module-2	4	980	January 25, 2023
Polynomial Regression Supervised ML: Regression and Classification week-module-3	6	599	November 22, 2022

CW W2 Lab 4: Creating feature vs changing model

Related topics