Feature Engineering - please help understand this

In the Optional lab - Feature Engineering and Polynomial Regression

We created a new feature, x^2 as follows:

create target data
x = np.arange(0, 20, 1)
y = 1 + x**2

Engineer features
X = x**2 #<-- added engineered feature

And then the Gradient descent function was called with X:

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)

I can understand this, as now instead of x, we have x^2

However, we go on to add more engineered features, like x^3:

create target data
x = np.arange(0, 20, 1)
y = x**2

engineer features
X = np.c_[x, x2, x3] #<-- added engineered feature

and call the gradient descent with X:

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)

I have two questions:

  1. Why is the model function given as y = x2 ?
    Should it not be y = x + x
    2 + x**3 + b

  2. Don’t understand how X can be an array with 3 columns, and this goes as an argument in the function call to calculate the y estimate, and y itself is defined with only x**2

Any help appreciated.

Hello @amitontheweb,

Let’s look at the lab from a different angle.

First, below is a feature that we can observe.

x = np.arange(0, 20, 1)

Second, below is the true label generation process. In real world application, we won’t know about this, but in this lab, the true process is revealed to us.

y = 1 + x**2

Third, we find that x is not good enough to model y. This is clear no matter we pretend we know the true label generation process or not. If we pretend we don’t know, we calculate the loss and find it too high. If we don’t pretend we don’t know, then x is certainly not the game-changer.

Then, we start to engineer some features. We pretend we don’t know the true process, and the most straight forward way is to make x**2, x**3, and so on. For this optional lab, we make these terms. Again, we pretend we don’t know the true label generation process.

Now we engineered 2 additional features, below is the code to stack the original and the new features together to form our final dataset - having 1+2=3 columns.

X = np.c_[x, x**2, x** 3]

Then we fit the model to the data X, and then we happen to find that the weight for x**2 is the largest. Now if we continue to pretend we don’t know the true label generation process, we can get to the conclusion that, given a reduced loss, the engineered features are making important contribution. If we don’t pretend, then we should conclude that the modeling result completely makes sense because x**2 is the truth and no other engineered features should be better than itself.

This is an exercise for us to see how feature engineering can improve model performance. In real world data, we won’t know the true data generation process, but in this lab, this is revealed to us so that we can more easily understand the exercise’s results - e.g. what to expect, and what not to expect.

Cheers,
Raymond

Thanks Raymond. I also did some more research on this and also went through the lab notes and the lectures. I guess I am pretty clear about doing feature engineering and trying out new features to see which one carries more weight, like you have mentioned.

A few other things helped me as well. Like we are changing the input data, and so we are still doing linear regression here, but with transformed x values, like x**2.

Likewise, the graph plot will be y against x as usual, but with x**2 used to plot the estimated y values.

I assume that since we have several x values, the code will use vector product to get the value of y.

There is one mystery element - the grad. desc. function used here - run_gradient_descent_feng

It will be great if the function code could be shared.

1 Like

Hello @amitontheweb, it’s great knowing that you are making progress!

You can find the lab’s supporting functions by first locating where the function comes from

Then click “Open” > “File” to go to the file browser where you can click to open the relevant .py file and check out the code inside.

Screenshot from 2022-07-31 08-29-15

Raymond

Okay, thank you very much, I got the files.