Feature engineering for multiple features

mwillson15 · August 21, 2024, 4:38am

I am trying to construct a learning algorithm for multiple regression using the Ames housing dataset.

I want to examine how implementing polynomial transformations on my feature values can affect model performance. However, I am not sure the correct way to implement these transformations on multiple features to be able to then run gradient descent.

For example:

In the “C1_W2_Lab04_FeatEng_PolyReg_Soln” lab above, it shows how to implement a polynomial transformation for one set of feature values using the “np.c_()” method. How is this done for a model with multiple features?

I’ve tried using the same “np.c_()” method on three features for a small subset of the dataset(5 training examples) and then stacked these arrays together.

The problem I have now is that these transformed feature values are in a 3-D matrix(shape= (5,3,3) in this case) and all of the algorithms we have learned so far expect the feature matrix to be 2-D with a shape (m,n). Disregard that these values are not normalized. I’m just trying to get the correct data structure.

I feel like I am missing something and I do not think this is the correct way of going about this. We have learned running gradient descent for multiple regression and implementing polynomial transformations for one set of feature values but how do you put both of these together to run gradient descent on multiple polynomially transformed features?

My only guess is to leave all the feature values the same and change the actual linear prediction function in the “compute_cost” and “compute_gradient” functions to a desired polynomial function? Or maybe still use the np.c_() method but run gradient descent on each feature separately? Not sure how this would work either. Sorry about the long question but wanted to give you sufficient context. Any insight would be much appreciated.

Alireza_Saei · August 21, 2024, 4:52am

Hi @mwillson15

Great job! but you don’t need a 3-D matrix for your features. Instead, you should expand your feature matrix by generating polynomial terms for each feature and then concatenate them into a single 2-D matrix.

Next step, stack them horizontally to create a new feature matrix with shape (m, n’) , where n’ lincludes the original and polynomial features. Then, you can apply gradient descent as usual (no need to modify the gradient descent algorithm itself).

Hope it helps! Feel free to ask if you need further assistance.

TMosh · August 21, 2024, 6:26am

Andrew’s original Machine Learning course (the one that inaugurated Coursera itself) included an example using non-linear combinations of two features.

The method used there was to compute all quadratic combinations of the original two features up to some threshold polynomial degree value. Each of these combinations is a new feature.

So for example, with two original features (call them h and k so I don’t have to deal with subscripts):
x = [h, k]

For degree of 3, you would have the following set of features for each example:

x_new = [ h,
k
h^2,
h * k,
k^2,
h^3,
h^2 * k,
h * k^2,
k^3]

So the two original features in each example were converted into nine features.

To explore the entire space, you would train several models, one for each polynomial degree.

Be careful to not take this method too far, since the number of features expands rapidly with the polynomial degree.

mwillson15 · August 21, 2024, 8:47pm

Thank you for your insight Alireza! It is much appreciated. However, I’m not quite sure what you mean by concatenating the transformed features into a “single 2-D matrix”. Wouldn’t then stacking these 2-D matrices together create a 3-D matrix?

I tried to use concatenation in the way I think I am interpreting your method below:

I concatenated 1-D arrays into single 1-D arrays, where each of these concatenated arrays corresponds to each feature, and then stacked all of them horizontally. This method creates a 2-D matrix but it does not meet the specific dimension values (m,n’) you explained in your initial response.

I then tried this alternative method and I think this is the matrix structure you outlined:

I used the same variables in the first method but just stacked them all together horizontally. This method has what I believe to be the correct dimensions and values (m,n’) where m is initial number of training examples and n’ includes the original and polynomial features. However, I did not use any concatenations so I’m not sure if this is what you meant.

If this structure is correct, how will my learning algorithm be able to differentiate between each feature variable if they are all defined as independent columns? In other words, in the initial multiple linear regression model, it was clear which features were being optimized because each feature had one instance in the model(x) and one index(X_train[:, 0] where j = 0). However, in this framework, each feature has 3 instances in the model(x^3,x^2,x) and 3 indices(X_train[:, 0],X_train[:, 1],X_train[:, 2] where j = 0,…,2). Does this matter or will you still be able to examine how each feature contributes to the model performance?

Additionally, if this structure is incorrect, I hope you can clarify the correct structure. Regardless, will you be able to input the correct structure into the learning algorithm to optimize the model where all these features are included in the model? Your initial response seemed to indicate that you can and that you can apply gradient descent as usual on this correct structure. However, the response from “TMosh” seemed to indicate that “to explore this entire space, you would train several models, one for each polynomial degree”.

We trained a linear regression model with multiple features using gradient descent just fine but does implementing these polynomial transformations complicate descent and because of this, is training each polynomial feature independently better practice?

mwillson15 · August 21, 2024, 10:17pm

Thank you TMosh, I appreciate it!

So in this case, to explore the entire space, the best practice would be to train nine different models for each feature?

I’m trying to optimize a multiple regression model using gradient descent to predict housing prices for a few features in the Ames housing dataset. I was going to construct three models and compare them using K-fold cross validation. Once the best model is chosen, I was going to further optimize the model using backwards elimination to eliminate any insignificant features. I was planning on using a multiple linear regression model and two polynomially transformed regression models to cross validate.

I wasn’t planning on getting crazy with the polynomial transformations but I just want each model to include the same number of features for proper comparison. If the best practice when polynomially transforming multiple features in a model is to train each feature independently, how can I compare these models to the initial model which is including multiple variables? Is it a case where you just need to train the model features independently and once done you can go back and include all the features with their optimized parameters in a model? Not sure if that makes sense.

My guess is that instead of transforming feature values to engineer linear relationships between features and the target to be able to better implement a linear regression model, would comparing a multiple linear regression model with a couple different regression models which are more suited to identify non-linear trends be a better method?

For example, cross-validating a linear regression model, random forest regression model, and a lightGBM regression model. This is an alternative method as I expect these latter models to use different cost functions and descent algorithms. I would like to stay within the scope of what has been taught so far but it’s not clear how to appropriately compare different models which include multiple features.

I am also thinking of just staying within a multiple linear regression framework and cross validating between linear models with different numbers of features. However, if these models exhibit non-linear trends, I don’t see how these models will be optimal and how feature engineering won’t be necessary.

TMosh · August 21, 2024, 10:38pm

No, I don’t think so. You can’t assume that the features are completely independent.

TMosh · August 21, 2024, 10:40pm

This may be a situation where a neural network can solve all this for you.

With a suitable selection of the size of the hidden layer, it can automatically create non-linear combinations of the input features so that the model gives the lowest cost fit.

In many situations a NN can entirely eliminate the need to do feature engineering.

mwillson15 · August 21, 2024, 11:01pm

Okay thanks, I’ll look into it.

rmwkwok · August 22, 2024, 3:04am

Hello, @mwillson15,

Is it impossible to verify it yourself? Actually you can, because you can look at each number and see if they are organized in the way you expect to. It is learner’s own job. Now, is the structure correct?

Again, the key here is that you verify the numbers are organized correctly. As for “concatenate” and “stack”, if we know them well, we can use either of them to achieve the same goal.

import numpy as np

A = np.arange(6).reshape(2, 3)
B = np.arange(6).reshape(2, 3) + 10

print(
    # np.concatenate's doc says "Join a sequence of arrays along an existing axis."
    # How do you understand existing axis?
    np.concatenate([A, B], axis=1)
)

print(
    # np.stack's doc says "Join a sequence of arrays along a new axis."
    # How do you understand "new axis"?
    np.stack([A[:, 0], A[:, 1], A[:, 2], B[:, 0], B[:, 1], B[:, 2]], axis=1)
)

If you run my code above, you will see what happen. Again, we can verify the structure ourselves by visually checking out the numbers in the resulting array.

Now, my question for you is, the docs say “concatenate” joins arrays along an existing axis, whereas “stack” a new axis, how do you understand their difference (existing vs new)? If you expect the final array to be in the shape of (m, n') which has TWO axes, how would you use stack and concatenate differently to reflect their different behaviors? How did I use them in my code above?

Your algorithm can’t read your python code to find out that the first three features have a power law relation. You give your algorithm 9 columns, they are 9 features. The linear regression model should have one trainable weight for each of these 9 features. Now, the question is, does the model have 9 trainable weights (plus 1 bias, perhaps)? How can you check the number of weights?

@mwillson15, we can discuss your answers to my questions if you like to.

Cheers,
Raymond

mwillson15 · August 23, 2024, 9:19am

Hello @rmwkwok. Thanks for your input, I appreciate it!

Let me start by answering your questions about np.concatenate() and np.stack() to provide more context to my confusion with @Alireza_Saei response(although I very much appreciate his insight!) and to describe my improved understanding of these methods.

In my response to @Alireza_Saei , the context in which I described np.concatenate and np.stack was as if they are fundamentally different methods. However, with an improved understanding, I now understand they provide similar functionality; to join a sequence of arrays. The key difference is that np.concatenate joins a sequence of arrays along an existing axis and np.stack joins a sequence of arrays along a new axis.

My understanding of this difference is that np.concatenate joins a sequence of arrays along an existing axis in that the input arrays must have the same shape except for the dimension corresponding to the axis parameter(default=0). The returned concatenated array will have the same number of dimensions as the input arrays and have the values concatenated on the default or given, existing axis.

In terms of np.stack, this joins a sequence of arrays on a new axis in which the inputted arrays must have the same shape and dimensions. This method returns a stacked array which has one more dimension than the inputted arrays. In np.stack the axis parameter corresponds to the index of the new axis, which the values are being stacked on, in the dimensions of the array to be returned.

How I would use np.concatenate and np.stack differently, to reflect their different behaviors, in order to create the expected final 2-D array with shape(m,n’) would be based on how I prepare my input arrays.

For example:

The code and outputs above depict my original 2-D feature array(X) from my initial response to @Alireza_Saei as well as my concatenate(X_eng_feat) and stack(X_eng_feat_2) methods(your implementations are also included at the bottom). In terms of array shape and dimensionality, the input arrays for concatenation should be 2-D arrays with identical shape(in this case). For stack they should be 1-D arrays with identical shape in order to obtain the expected 2-D array with shape (m,n’).

In terms of correct transformed values and orientation for the concatenate method, I performed square and cubic transformations on the initial 1-D arrays in which my original 2-D feature array(X) is comprised of. For each feature, I then stacked these 1-D arrays on a new, horizontal axis(axis=1) to create 2-D arrays with shape (5,3) where axis=0 was the number of examples and axis=1 included the initial and transformed values for that feature. I then concatenated these three 2-D arrays along the existing horizontal axis to create the expected final 2-D array with shape (m,n’). For the stack method, I took all of these initial and transformed 1-D feature arrays and stacked them all together on a new horizontal axis(axis=1), in the orientation I think is ideal, to also create the expected final feature array.

Our implementation of concatenate and stack are slightly different and result in slightly different final arrays. For example, your implementation of concatenation works exclusively with 2-D arrays. I tried this implementation and while it creates a 2-D array with the correct shape, dimensionality, and transformed feature values; these features are not in the orientation in which I think is ideal for this model. Specifically, your concatenation implementation has the features in the orientation of x1, x2, x3, x1_square, x2_square, x3_square, x1_cube, x2_cube, x3_cube. My opinion is that a feature orientation like x1, x1_square, x1_cube, x2, x2_square, x2_cube, x3, x3_square, x3_cube would be more ideal as it emulates how the feature variables would be orientated in the model function. In terms of your output with my feature orientation would be 0, 10, 1, 11, 2, 12 (for that example). I’m not sure how much this matters or if it changes anything but it seems to correspond more to how the model is constructed. As you mentioned, I understand that my algorithm can’t read my python code to find out that certain groups of features have a power law relation. I just prefer it because I think it has a more similar organization to the model. I’m sure there is a clever way to implement concatenation which achieves my preferred feature orientations and works exclusively with 2-D arrays where you don’t stack before you concatenate like I did. However, this implementation although more rugged and less pretty, worked for me and I used all the same feature variables in the stack implementation as well.

Your implementation for stack is similar but instead of using array slicing to slice 1-D feature arrays out of 2-D feature arrays and stacking them together horizontally, I just used all the 1-D feature array variables I had already defined. Our stacked arrays also have the same difference in feature orientation. This implementation of stack is pretty much identical to the one I initially posted except I have taken some of your insight and used square brackets to enclose the input arrays in my concatenate and stack methods. I’ve been enclosing them in parentheses and it seems to yield the same outputs so not sure if this merits much consideration.

As an enthusiastic learner, who values the learning process, I would have been more than willing to follow your verification process of looking at each number and seeing if they are organized in the way you expect i.e., the correct structure or (m,n’). However, the “correct” structure was not clear.

@Alireza_Saei initial response outlined that I “don’t need a 3-D matrix for your features. Instead you should expand your feature matrix by generating polynomial terms for each feature and then concatenate them into a single 2-D matrix. Next step, stack them horizontally to create a new feature matrix with shape (m, n’) , where n’ includes the original and polynomial features.”

Given your insight and implementation, everything I outlined above, and the numpy documentation for these methods; would stacking 2-D feature arrays together return a 3-D feature array? If yes, I hope you can understand that trying to verify a correct structure (m,n’) from instructions which do not yield this structure can be slightly misleading and as a result prompted my response. Although this response was coupled with a loose grasp of the fundamental functionality of some of these tools; I can definitely say this understanding has been improved from this discourse. My confusion could also be a product of semantics and my understanding of his instructions or his illustration to me could have been lost in translation as I do not doubt @Alireza_Saei expertise!

Lastly and most importantly, yes, my model would have 9 trainable weights and 1 bias. I say would because I have not started training yet. I can make sure I have the correct number of weights by confirming my weight array “w” has the correct shape (n’,). I can check the number of weights by confirming that this line in the gradient descent function “w = w - alpha * dj_dw” is computing an element by element multiplication between alpha and the gradient and not a dot product to ensure that the shape of the weight array remains constant throughout training, perhaps? I could ensure that the weight gradient dj_dw also has the shape (n’,) and remains constant throughout training. Maybe print out the weights throughout training?

I also say would because it is not clear whether this process would be ideal. Are you saying that if I can ensure that my model will have one trainable weight for each corresponding feature; I can train this model with gradient descent with all the initial and polynomially transformed features included in the model? @TMosh mentioned that a neutral network can potentially address a lot of these questions as it can automatically create non-linear combinations of the input features so that the model gives the lowest cost fit. However, I am not there yet and would like to conduct this analysis as I’ve described throughout my posts if possible?

Thanks again Raymond for your insight. I hope we can continue this discourse.

Best Regards,
Matthew

rmwkwok · August 23, 2024, 1:05pm

Hello, Matthew @mwillson15 , I appreciate your time for the response. Somehow it reminds me of my past self. I was also used to write out my thought process.

I understand it We are, sometimes, bounded to what we know. I believe you were just trying to explain yourself so that Alireza will understand you better. No problem with that

Since we may discuss more, I think it may also be good for you to know what I was thinking when I wrote my reply - I hope we leave the past behind us and focus on the improvement. In other words, if you can take care of the “leave behind us” part, I can focus on the “improvement” part. Is that a good idea?

For the rest of your response, I am glad to see that you know the difference between concatenate and stack, and if you don’t mind to read my version of explanation:

when I want to combine many feature arrays into one, I think about what is the axis that will grow.
Obviously, axis = 1 will grow because sample size won’t change but number of features increases. Features are along axis = 1, so it grows.
Now,
- concatenate acts on existing axis, and since that axis is axis = 1, the input arrays should all have such axis and thus be 2D. Existing axis, right?
- stack results in a new axis, and since that axis is axis = 1, the input arrays should all NOT have such axis and thus be 1D. If the input arrays don’t have axis = 1, then the output array will have it because stack creates that. If the input arrays had axis = 1, since stack creates a new axis, the output array would have axis = 2 and thus become 3D. We don’t want 3D.

It won’t matter. You can have your preference on how to order the polynomial features, but the ordering should not change the performance of the model and the weight values associated with each feature.

Of course you can! I would do it exactly in this way!

Of course, do it in whatever order you like to!

Cheers,
Raymond

mwillson15 · August 23, 2024, 10:32pm

Thanks Raymond @rmwkwok for the quick response! I definitely agree that my initial lack of understanding of these methods was preventing me from simply inferring what Alireza was trying to explain. This would have allowed me to be more confident in verifying the correct structure myself.

I was under the impression that his instructions would yield that shape and I was conceptualizing a structure where the initial and transformed values for each feature were defined in their own arrays through the horizontal axis. I now understand this structure will most likely always be in 3-D.

Lastly, because this model would have 9 features, if I wanted to implement some visualizations I should plot these prediction regressions against feature/target distributions for each feature separately, correct? I know this is opening up a whole other can of worms and I understand that trying to visualize relationships between a target variable and more than 2 feature variables becomes increasingly complex, impractical, and even impossible after about 5 different features. But if I want to visualize these relationships separately is it okay to use the weights optimized from the model which included all of the features or is it better practice to retrain each model separately and use the optimized weights obtained from each of those corresponding “single” regression models? I’ve also read a little on partial residual plots but accounting for correlations between features seems to be a common issue in all types of visualizations.

Thanks,
Matthew

TMosh · August 23, 2024, 10:55pm

I don’t think training using only one feature is going to be useful or informative.

rmwkwok · August 24, 2024, 1:35am

Hello, Matthew,

The visualization approach you proposed should take less than 15-30 minutes to implement with good coding experience, or it could be a good exercise to gain the experience. Either way I will suggest you to give it a try and see for yourself. As a mentor for learner, unless there is a very good reason not to, I encourage learning activties.

We can discuss your findings, we can find out value from your findings, and we might attempt to propose how to grow something from your findings. However, it all begins with your findings, not just a proposal

We take whatever useful from our failures, and we build up whatever we can on our successes.

Cheers,
Raymond

PS1: If you are used to giving your idea a try, your coding skills will be very good. With good skills, rather than typing the question and waiting for someone, that time is probably enough for you to code things up (not to bury ideas), see the results, and decide what to do next (not to bury results, but, for example, invite for opinions). Progress!

PS2: Your skill is yours, and is the one thing you must be able to take away

mwillson15 · August 24, 2024, 5:26am

Yes I totally agree @rmwkwok ! I was going to preface that last response with this is going to be my last question as I would like to just start training and see where it takes me. I agree your mistakes can be your most valuable teacher which is why I would like to do a few projects before moving on with this specialization.

That last question was more me not being sure if there would be clear visual/statistical differences when using the weights from the entire model or individual models but there are more ideal practices for these situations just for more context to know what to expect but I will see for myself! Also, no time has been wasted in between this discourse as I have pretty much set up the whole code structure for the analysis from scratch to get some more practice recalling these methods! Anyways, thanks again for your insight. It is always appreciated.

Best Regards,
Matthew

rmwkwok · August 24, 2024, 2:27pm

No problem! I don’t know what your proposed visualization will lead us to. Maybe a dead end. Maybe something completely out of our initial thoughts. But that’s the fun part. Onwards!

Cheers,
Raymond

Topic		Replies	Views
Polynomial regression and Feature Engineeringin Supervised ML: Regression and Classification week-1	2	476	November 24, 2022
Quick question Advanced Learning Algorithms week-3 , how-to	3	18	October 25, 2024
Feature engineering - Week 2: Regression with multiple input variables \| Supervised ML: Regression and Classification week-1	5	522	July 27, 2022
CW W2 Lab 4: Creating feature vs changing model Supervised ML: Regression and Classification week-2	14	476	May 20, 2023
Feature Engineering - please help understand this Supervised ML: Regression and Classification week-2	4	664	July 31, 2022

Feature engineering for multiple features

Related topics