Using correlation matrix for feature selection

hi everyone,

I have watched and read a few tutorials on Exploraty data analysis and they recommend analyzing the data to explore patterns with pandas to select better features for the models.

I wonder would a correlation matrix be just enough for EDA especially for linear regression and logistic regression problems.? a relatively high negative and positive correlation would imply a strong relationship between the label and the feature. so why not just use them as a start and then maybe try polinomial features ?

p.s. I am trying to solve the famous toy sets of kaggle these days. Would it be appropriate to send my notebook here and ask to receive feedbacks?

1 Like

Hi Mehmet! I hope that you are doing well.
Generally, we use a correlation matrix to study the correlation between independent variables (among features) and between independent and dependent variables( i.e. between features and our target variable). We do this because of the following reasons :

  • Correlation between features and target variable ::: We try to check which feature has a high correlation with the output variable – > This will be an effective feature which will play a main part in predicting our output variable.
  • Correlation among the features ::: This is done in order to drop highly correlated features. Why drop them? Say for example X1 and X2 are highly correlated then they will have the same kind of effect on the output variable, change in performance observed when [we use (both X1 and X2) and when we use only X1 or X2] tends to be negligible. So at least we can try to reduce the computation cost by dropping anyone of them.

So as you can see, we decide based on both mathematical ideas and some intuition. We can’t simply say that we can conclude our EDA only with a correlation study, for some cases it might be enough and for some, it might not be. It entirely depends on the dataset. So as you said we can start at any one point, evaluate our performance and then look at other methods to improve the quality of our data. So it’s clearly up to you how to proceed. But as said, these(correlation study, etc) are some common things that we do always tend to do so that we have a good start.
Well, this is just my opinion, do share your thoughts too,if you have different things in your mind.

Regards,
Nithin

3 Likes

Hi @mehmet_baki_deniz

in addition to @Nithin_Skantha_M‘s great feedback.

Here you can also find an example as inspiration also for context and code in this Repo Source.

Correlation and scatter matrices help a lot as you mentioned e.g. linear regression. For example you could apply feature engineering techniques to describe the non-linearity already in your features and check and adapt your feature design based on your visualization. Here also a relevant thread for you:

Hope that helps!

Best regards
Christian

3 Likes

hi, thank you to both of you for your responses.

for linear regression and logistic regression, what else can we observe that provides us more insights for feature selection than the corr. matrix?

I am asking this because in most of the tutorials, what I understand is visualization and grouping-sorting only seem to provide a manual analysis to speculate on the existence of correlation between the independent and dependent variables. So I figured, why not just use corr. matrix?

then I thought I could use pol. features to see if it helps to lower the cost
but then @Christian_Simonis’s reponse made me think that maybe we should also consider feature crosses beyond polynomial features. then maybe visualization as in christian’s repo may help to figure out which ones to select?
but then… cant I also do it by the correlation matrix again by randomly adding some feature crosses to the dataframe…? :slight_smile:
after all, visualization also depends on our intution on which features to visualize right?

okay… part of this is about why to bother too much with learning pandas and seaborn etc… :slight_smile:
and other part is a geniune curiosity.

p.s @Nithin_Skantha_M your point on using corr matrix for an analysis of relationship between independent variables is so great. I havent thought about it. It seems so good.
ps2 @Christian_Simonis I will check your repo more in depth. It seems very informative! thank you for sharing

Mehmet

2 Likes

Hi Mehmed,

my personal experience is that only using correlation coefficients can be a great start for sure, but often it’s just not sufficient:


(Source)

What I mean with it: correlation just captures the linear dependency between features, but you can also have highly structured patterns that are not indicated by just looking at the Pearson correlation coefficient (e.g. when they are highly non-linear):

Hope that helps, @mehmet_baki_deniz!

Best regards
Christian

2 Likes

I would suggest iteratively enhancing features (while checking in the visualisations) makes sense by systematically bringing more domain knowledge into your features. Feature crosses are one way but you are not limited to this. Feel free to use any mathematical operations resp. physical or other domain models that just make sense to derive good features. My experience is that signal processing techniques (that you learn in Mechatronics or System theory) are especially helpful, e.g.: for example a Fourier transform can come in handy when you work with oscillating systems and want to use your model for predictive analytics of a steady state system.

Besides that, in general a residual analysis is always helpful when evaluating your models resp. working on your features, see also: Anomaly Detection Algorithm Statistical Independence - #2 by Christian_Simonis

Please let me know if anything is unclear.
All the best and happy learning, @mehmet_baki_deniz!

Best regards
Christian

2 Likes

These are very insightfull analysis christian.

here in the following picture, I understand that the decision boundery is circlear(something like x1^2 +x^2 right? ). So visualization helps. In this visualization, you intentionally chose two features to visualize the non-linearity. but what exactly led you to choose to visualize those features within the pool of m many features? moreover, if the correct model were sth like x1^4+x1^2Xx3^4 would we be able to visualize it? wouldnt we be forced to find a non-visual method for feature engineering?

wouldnt a better approach be to add polynomials of the features(that already seems important) to the dataframe and see if the new (linearized) features have correlation ? I just presume 3 degree would be enough to avoid overfitting.

Maybe that also wouldnt solve the problem as it doesnt adress selecting possible feature multiplication options…

as per signal processing techniques for ml applications, I have no idea what they are but I will google it

p.s I will read through the thread to gain more insights into the problem with a fresh mind tommorrow

You can describe the data with sin(t), cos(t) relation, see:

After transforming the data, they become linearly separable.

An other way: you can also solve a problem like this with polynomial approaches, see: The Kernel Trick in Support Vector Classification | by Drew Wilimitis | Towards Data Science

When analysing the residuum you do not want to see any systematic pattern. If the model did a bad job you could see at least some pattern in the residual data and no random “white noise”.

This can absolutely help in data understanding.
Why do you think it would be “better” to add new features?

I think the transformation w/o growing the dimensional space is already making the data linearly separable in a minimum dimensional space. To me it seemed quite elegant this way. But many approaches solve the issue.
In general: The most suitable approach depends on the data as well your business problem you are solving. Often in reality it is sufficient to find a solution which is just “good enough”.

Best regards
Christian

1 Like

I proposed this approach to explore the existence of a possible non-linear relatiobnship.so maybe x doesnt correlate with the target value but maybe x^2 does. But then I thought, how can we possible explore a possible non-linearity such as x1XxX2 with my method ? I donno.
for really big feature datasets, I just suspect visualization wouldn’t tell us much about selecting the features. so, I speculate, finding a more generalized mathematical formula for feature selection is needed.

I also didnt get what you mean by transformation but I asked it in the thread that you referred since you already have one explanatory post in it. I mean I get the reason-it saves you from non-linearity by feature engineering. but what is the method for transforming?

p.s maybe another safer method could be lasso regularization to butcher all the irrelevan features from the model ?

Polar coordinate transformation with code example here: Can we start with the circle equation as decision boundary? - #12 by Christian_Simonis

Please have a look! Basically it’s exploiting the information in the Euclidean, radial distance which has good predictive abilities combined with a linear classifier for this specific problem.

True! You can for example calculate feature importance as described with a code example in the repo on CRISP-DM.

I would suggest to read carefully through the provided sources that should support you along your learning journey, @mehmet_baki_deniz!

Hope they help.

Best regards
Christian

1 Like

Hello @mehmet_baki_deniz, please feel free to.

Raymond

1 Like