[Seeking Guidance] Heart Disease Classification

I’m getting these metrics after training a
Logistic Regression model,
a Random Forest model (n_estimators=200, max_features=“sqrt”, bootstrap=True, oob_score=True),
a NN (3 Hidden layers with 64, 32, 16 units, relu; 1 output layer with sigmoid).

Should a Random Forest, or NN usually not have better accuracy?

Link to the Jupyter Notebook: https://rb.gy/qwu8c6

1 Like

Hello Roy,

After going through your notebook, after reviewing your notebook, I came across this link for the error you are getting. it mentions related bootstrap being true.

also related to your max_features you used sqrt, try using the random sub feature(use link to understand better)
Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.

Check the link below it might help you

also check if you can use L2 hyper parameter (when I am stating this I am not stating L1 is not correct. Just to see if you find a different predictive analysis.

Regards
DP

1 Like

Hello @Debatreyo_Roy,

Random Forest → RF
Logistic Regression → LogR
Neural Network → NN

It is extremely dangerous to put an equal sign between a certain method and a certain performance expectation. We can easily build and configure a NN that is doomed to be overfitting the data and be performing very badly.

Your notebook is a good starting point to see some results, but it is jumping too fast to expect to see “complex model wins simple model”.

  1. Did you inspect the other parameters of RF that was not tuned in your GridSearchCV? Are their default values good enough to beat down your best LogR?

  2. In the training logs of your NN, the test set loss drops at first and then climbs later. What does this signal? Is training longer always better? Did you get your best NN? If not, why should we be surprised that a bad NN to perform not better than the best LogR?

@Debatreyo_Roy, my overall impression is that, you have kick-started it which is extremely important, but the notebook has not shown that it had got the best RF nor the best NN to compare with your best LogR, and therefore, we cannot blame the methods.

I recommend you to focus on RF and NN one at a time, study them thoroughly and convince yourself you have got the best RF and the best NN by exploring everything you can configure about them.

For RF, the full list of things you can configure is the list of input arguments for sklearn.ensemble.RandomForestClassifier. However, for NN, there is no such full list. Will you google? WIll you find out how others deal both with overfitting and with underfitting? Show your research in the notebook ;).

Good luck!
Raymond

PS: I moved this to the AI Projects category.

1 Like