Clarification on C2_W2_Lab_3_Feature_Selection - High correlation features

For the heat map generated under Correlation Matrix, notice high correlation reported for the following features as well
perimeter-se and radius-se => 0.97
perimeter-se and area-se=> 0.94
radius-se and area-se => 0.95

However, didn’t see these features being considered as reducible into the subset_features besides the perimeter-mean / perimeter-worst series.

If follow the original example flow (without removal of highly correlated features of the xxx-se series,

  • Total features involved is 21 for subset features category
  • Total features involved is 20 for F-test category

But if the xxx-se features being considered as reducible, the final result will be

  • Total features involved is 19 for subset features - exclude perimeter-se and radius-se category

And this will be a different conclusion that F-test category being the most optimal of these 3 because it uses the least number of features since all 3 demonstrates the same level of Accuracy, ROC, Precision, Recall and F1 Score.

So wondering why in the first place, the perimeter-se and radius-se were not included in the subset feature example demo. Is there anything that I’ve misinterpreted in this regards? Kindly advise. Thanks.

I’ve taken a look at the data and I presume the xxx_SE to be std error. It makes experimental sense that there would a high correlation here as the measurements would presumably have been taken under the same conditions, or, if they are all derived by say the radius, then the error would carry.

However, this still leaves your valid question unanswered. I agree that two of these three ‘xxx_SE’ features should be removed. @chris.favila, were these features accidentally excluded from the exclusion list, or is there something we are missing?

1 Like

Hi! Good observation! I think this boils down to domain knowledge. There might be cases that you want to retain certain features because they can help detect rare conditions in the dataset. That will require some consultation with field experts. Removing all seemingly correlated features may run the risk of poor model accuracy in some data slices (*slightly related reading here regarding feature selection with a trained eye). That being said, I don’t know much about this field for me to justify why those were retained by the subject matter expert.

However, if the goal is to simply have the fewest set of features, I agree with you and Chris above makes a good point on why these can be removed. We’ll modify the notebook to leave it as a challenge for other learners to spot other correlated features that will result in the same performance. Thank you for pointing this out and I hope this helps!

2 Likes

Thanks @Chris @chris.favila for your times in attending to this topic.

There are times we were given a dataset to work on and not really having a field expert to consult or could be in a scenario where was just told the data was queried out as raw from the database with no clear documentation. So having understanding on what to include / exclude and yet still be able to explain why it is done so is crucial when dealing with clients/stakeholders. From the explanation given above, it helps to clear my doubts that I might have misinterpreted / miss out some points. Seem like feature selection is sometimes more an Art than Science with experiments to be carried out till model metrics performance if computation times/resources are not an issue.

Nonetheless, I think the methods of filtering features demonstrated in the notebook is still practically useful to all learners here. Thank a lot!

2 Likes

Excellent points! It’s definitely important to justify your decisions, especially if you don’t have field experts to consult. And I agree as well that there is an art to this process. Glad you found the notebook useful! Thanks again!