CW W2 Lab 4: Creating feature vs changing model

At the end of Week 2, we consider polynomial regression. In Lab 4, I assumed that our model would change from a linear one (y = mx + b) to a polynomial one (y = x^2 + 1). I was surprised that, instead, we created a new feature representing x^2, and used the same linear model.

Why did we create a new feature representing x^2, rather than using a non-linear model? Would the two different approaches (creating a feature for x^2, versus using a non-linear model of y=w*x^2 + 1) give us the same results? What would the benefits and drawbacks to the two approaches be?

We’re making a more complex data set, by including new features that are non-linear combinations of the original features.

We can still use a model based on their linear combinations (the sum of each each feature multiplied by a weight). This means we only really have one cost function, regardless of the complexity of the data set or the number of terms in the summation.

Thank you for the response. I find this approach pretty versatile.

But, why have you brought up the cost function? If my model is, say w*x^3, isn’t my cost function still the same no matter what the model function is? If my cost function is mean squared error, I still compute the average of ((f(x) - y) * x)^2, right? Just because f(x) is now w*x^3 instead of wx+b shouldn’t change the cost function… it is still mean squared error, right?

Perhaps I can rephrase my question. Can I use gradient descent and “polynomial regression” to fit a “curve” in the data set by leaving my input, X, alone, and using a quadratic function as the model, instead of a linear function?

If not, why not? If so, then why not do this instead of creating features and using a linear model?

Yes, it is. My opinion, there really isn’t anything called “polynomial regression”. It’s better to think of it as linear regression using polynomial features.

Got it, thanks.

So then, theoretically speaking, would a regression algorithm (cost, gradient descent) using a non-linear function (eg a quadratic, log, sqrt, trig) still “converge” if the data set could be fit by the function?

I’ll try some digging online and work some examples out by hand.

What are you referring to when you say “using a non-linear function”? The cost equation, the predictions, or something else?

Prediction. I understand the “model” within linear regression to be a function, f(x) = wx + b. I’m curious if a ‘non linear regression’ model can make predictions with a function like f(x) = wx^2 + b.

Yes. You do it by squaring x first and including it in the training set.

Don’t think of ‘x’ as a variable. Those are all the features in the training set. They’re just real values.

So if your training set just had one feature ‘x’, then you might add features of x^2, x^3, etc.
Then when you include them in the training set, each example will have features x, x^2, and x^3. Each feature has a weight, so you’ve have three ‘w’ values.

Then you train and learn the weights.

Thank you for the additional responses. I am not sure if my question is being understood correctly. I do understand that I can create a new feature from another one, computed as the square of the original, for each sample in the data set. I know that x is not a variable, but a feature. I know that x^2 would be a new synthesized feature.

My question is, if I have:

  1. Feature x and feature x^2
  2. A linear model that uses the function y = wx + b
  3. And I provide the model the feature values x^2, as the x input of the linear function

Is this equivalent to:

  1. Feature x
  2. A nonlinear model that uses the function y = wx^2 + b
  3. And I provide the model the feature values x, as the x input of the nonlinear function

Would I get the same result?

If so, why don’t we create models that use non-linear functions in their implementation, rather than computing synthesized features (eg x^2) and sticking with the usual linear function y = wx + b? Is it for flexibility and extensibility? (Is it just easier or more flexible to compute x^2 or whatever we want, rather than always changing the model’s function?)

If not, why not?

Just to be clear, when you add features, ‘x’ is no longer a scalar. It’s a vector, and ‘w’ is a vector. So w*x is a dot product for each example.
x[0] will be the original ‘x’ value.
x[1] will be the new x^2 value.
Using this method allows for both a linear and non-linear characteristic (such as the shape being a parabola that is offset in either axis).

Also, to be clear, they don’t equal ‘y’, because those are the labels. It equals a prediction, I’ll call it ‘h’ here (for the hypothesis).

So if you write out the equation after adding the squared feature, you have for each example:
h = x[0]*w[0] + x[1]*w[1] + b

Regarding your last question - yes, you could use any non-linear function you like to create new features. You’d have to try a lot of them to get one that works well for a specific set of data.

But polynomials have some advantages:

  • Polynomial series are able to approximate any sort of real function.
  • Polynomials are very easy to compute. Other non-linear functions like logs, square roots and trig functions are not very easy.

I think the difference is whether we want to pre-compute the non-linear features once and for all, or compute it every time from the raw x at each round of gradient descent. If we are ready to run 10000 rounds of gradient descent, the difference will be pre-computing each non-linear feature for 1 time, or computing it 10000 times for each of the non-linear features at each round of the gradient descent.

Obviously, pre-computing it saves time.

BTW, I think most packages assume you have pre-computed it as it always go the linear combination way. I have not seen any popular package that will allow you to program that, not at least recently. I had only done that like > 13 years ago…


1 Like

Thanks @rmwkwok . The pre-computing advantage is one I hadn’t thought of.

You are welcome. Changing from x_2 to x^2 in the model form has no other effect on the training result. I think the most critical point of consideration is how it affects the weight update formula, but since w_2 is always linear to y no matter it is x_2 or x^2, it just has no effect except for the amount of time to run.


I believe I underestimated how complex a polynomial might get and the advantages of pre computing make sense.