Rate my code - Heart Failure project from italian patients

Hello,

I am a cardiologist and currectly working in a heart failure project.
There are 6 centers in italy, I am managing one of them, each center will get around 150-300 patients, so we should get around 1000 patients

Right now I am working only with the data of my center (about 240 patients).

We have a few patients characteristics like sex, height, weight, hypertention, cancer, ecc and we already know if they have heart failure

This is my code, how do you think I can improve it?

As you can see I have already put a lot of question marks because I am not sure which is the best way to do some steps

#regularize
scaler = StandardScaler()
x_all_scaled = scaler.fit_transform(x_all)

x_train, x_cv, y_train, y_cv = train_test_split(x_all_scaled, y_all, test_size=0.20, random_state=RANDOM_SEED)

# hypertuning
# don't add too much stuff
params = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.001, 0.01, 0.1],
    "gamma":[0.5, 1, 1.5],
    'n_estimators': [100, 500],
    #"subsample":[0.6, 0.8, 1.0],
    "colsample_bytree":[0.6, 0.8, 1.0],
    "early_stopping_rounds": [10],
    "random_state": [RANDOM_SEED],
    #'min_child_weight': [1, 5, 10],
    }

cv_folds = 3


# hypertuning
xgb_hyper =  XGBClassifier() # add objective='binary:logistic'????

skf = StratifiedKFold(n_splits=cv_folds, random_state=RANDOM_SEED, shuffle=True)

# Use GridSearchCV for all combinations
grid = GridSearchCV(
    estimator = xgb_hyper,
    param_grid = params,
    scoring = 'roc_auc', # or 'precision' or 'accuracy'?
    n_jobs = -1,
    cv = skf,
    verbose = 1,
)

# Model fitting
grid = grid.fit(x_train, y_train, eval_set=[(x_cv, y_cv)])

print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

y_hat_train = grid.best_estimator_.predict(x_train)
y_hat_cv = grid.best_estimator_.predict(x_cv)

# why best model and auc not the same????
# why precision and recall is 0????

print(f"AuC: {roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1])}")
print(f"Train - Accuracy: {accuracy_score(y_train,y_hat_train):.4f}\tPrecision: {precision_score(y_train,y_hat_train):.4f}\tRecall: {recall_score(y_train,y_hat_train):.4f}")
print(f"CV    - Accuracy: {accuracy_score(y_cv,y_hat_cv):.4f}\tPrecision: {precision_score(y_cv,y_hat_cv):.4f}\tRecall: {recall_score(y_cv,y_hat_cv):.4f}")

1 Like

What is your project’s goal?

Improve the code: right now i have an auc or around 0.7, i would like to get more than 0.8

do you see anything wrong that could be done better

It you meant more generally: it s about discovering heart failure :laughing:

That’s what I was asking about: What is the intended output of the model.

Sorry I thought it was clear from the code: it’s a classification task
The model should output 1 if heart failure, 0 otherwise
the correct label is in the y, in the x train and cv there are the patient characteristics, I think it’s always this way? :sweat_smile:

There’s no need to scale features for XGBoost classifier.

The proper way to scale data for classification problems:

  1. train_test_split should make use of the stratify parameter.
  2. Re-scaling parameters should be learnt only on train data. fit_transform training data and transform other data splits. train_test_split should be done before re-scaling data.

Questions based on the shared code:

  1. What’s the label distribution?
  2. Can you confirm that the test distribution is similar to that of train distribution?
2 Likes

Hello,

Good point about the distribution, here it is:
all: 240 - positive 84.0 - percent 0.3499999940395355
train: 144 - positive 53.0 - percent 0.3680555522441864
CV: 48 - positive 15.0 - percent 0.3125
test: 48 - positive 16.0 - percent 0.3333333432674408

To me seems good

Note: since we are still adding patients (it’s an ongoing project, these number may change)

About the scaling there is something not clear

There’s no need to scale features for XGBoost classifier

but later you speak about scaling. Maybe your second part of the post is more generally speaking because I was not doing it properly but anyway I don’t have to do it for XGBoost? Also found this Is Normalization necessary? · Issue #357 · dmlc/xgboost · GitHub but I don’t understand the mathematical assumption on why it’s not needed

Also, do you advice about using 3 dataset: train, cv and test? At the begin of the course Prof Ng uses 3 dataset then only train and cv

About the stratification, I have less than 300 patients, should I do it with a so small dataset?

In this case, the distribution of the input features is of interest.

Correct. It was information regarding scaling data properly. Don’t re-scale for xgboost classification.

It’s safe to stratify when splitting the dataset (irrespective of the size of the underlying dataset) so that the data splits have the same distribution of the target label.

The code shared is regarding XGBoostRegressor. reg:gamma is meant for a regression target which follows gamma distribution. This parameter doesn’t have to be considered for a classification task.

You should be okay with 2 splits (70/30 is what Andrew mentions in deep learning specialization lectures where there is no cv split) for a dataset of this size. I’ve seen many people do 80/20 splits as well.

Does this (keeping decision trees in mind) help?

@TMosh Does MLS have any material around tree based approaches?

1 Like

Nice article, Andrew in the lessons doesn’t talk about the residuals correction, he should, it’s very clever and easy to understand

About the reason why decision trees don’t have to be normalized I actually haven’t understood well from your article (maybe more matematically inclined people than me will) but from this stack exchange question was more clear: beginner - Do you have to normalize data when building decision trees using R? - Data Science Stack Exchange

Abount gamma, thanks, I was brainlessly using it, now I checked it and found almost your same words Configure XGBoost "reg:gamma" Objective | XGBoosting

The “reg:gamma” objective in XGBoost is used for regression tasks when the target variable follows a gamma distribution. This objective is suitable for non-negative target variables with skewed distributions

Definitely not needed for my dataset

Good suggestions overall, in the weekend I will follow them and check if I can do better

Just a few more questions:

  1. XGBoost Parameters — xgboost 2.1.2 documentation - should I specify an objective in the costructor? like ‘binary:logistic’
  2. colsample_bytree - I understood from the lessons that this parameter was always used in random forest, maybe this is not the same for XGBoost, do you advice to use it with just 21 features?
  3. scoring - any advice? From the medical field I am used to the AuC to compare model so I tought to use it

I will post in this topic an updated version of the code

Not required. XGBClassifier is smart to distingish between binary / multi class classification problems when you invoke .fit on it. The objective will be set to either binary:logistic or multi:softprob

Use GridSearchCV for this. Try values in range of [.5, 1]

eg:

from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import numpy as np
X, y = load_iris(return_X_y=True)
grid = GridSearchCV(param_grid={'colsample_bytree': np.arange(5, 11) * .1}, 
                    estimator=XGBClassifier(random_state=42),
                    scoring='accuracy')
res = grid.fit(X, y)
print(res.best_params_) # {'colsample_bytree': 0.5}

I don’t know about the problem domain to confirm your choice. roc_auc looks reasonable though.

1 Like

image
maybe he wants to find out the factors leading to heart failure

1 Like

Hi, @aster94. The code you have is great. There are a few improvements you can use.

Avoid data leakage by fitting the scaler only on the training set and then apply it to both training and validation/test sets (i.e. do split first and transform after):

x_train, x_cv, y_train, y_cv = train_test_split(x_all, y_all, test_size=0.20, random_state=RANDOM_SEED)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_cv_scaled = scaler.transform(x_cv)

Perform feature engineering after splitting:

  • If you calculate features like averages, correlations, or interactions using the entire dataset, it may leak information from the validation/test set into the training set.

Solution: Perform any feature engineering steps, such as creating BMI or aggregating statistics, after splitting and ensure they are computed only from the training data.

Cross-Validation with Leakage: If preprocessing steps or hyperparameter tuning use the entire dataset (e.g., during cross-validation), leakage occurs.

For example:

  • Scaling inside a cross-validation loop without isolating the validation fold leads to leakage.

Solution: try using a pipeline to ensure preprocessing and model training occur separately for each fold during cross-validation:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

pipeline = Pipeline([
(‘scaler’, StandardScaler()),
(‘xgb’, XGBClassifier(eval_metric=‘auc’))
])

grid_search = GridSearchCV(
estimator=pipeline,
param_grid=params,
scoring=‘roc_auc’,
cv=3
)

grid_search.fit(x_all, y_all)

Using Target Information: If any feature is derived from the target variable (y_all), it introduces leakage. For instance:

  • Imputing missing values in predictors based on the target variable.
  • Encoding categorical variables using target-dependent metrics (e.g., target mean encoding) before splitting.

Solution: Make sure the target variable is not used for any preprocessing step.

Finally make sure of the following-basically means have ‘training’, ‘test’, & ‘validation’ datasets:
Dedicated Test Set:

  • Reserve a portion of the data exclusively as a test set that is never seen during training or hyperparameter tuning.

Normalization ensures fairness among features, faster convergence, and better model performance, particularly for algorithms sensitive to feature scales. Always consider the type of algorithm and data characteristics when deciding on normalization.

Handles Features with Different Scales

  • Raw data often contains features with varying ranges:
  • Example: Height (in cm) ranges from 150–200, while weight (in kg) ranges from 50–100.
  • Algorithms that use gradient-based optimization (e.g., logistic regression, neural networks) are sensitive to these differences because features with larger scales dominate the optimization process.

Impact:

  • Models may converge slowly or fail to find the optimal solution.
  • Some features may be ignored due to their smaller scale.

Normalization Solution:

  • Scales all features to a similar range, making optimization more efficient.

When to Normalize

  • Always normalize when using models sensitive to feature magnitudes (e.g., gradient-based methods, distance-based algorithms).
  • Avoid normalizing when:
    Using models like decision trees or random forests, which are scale-invariant.
    Feature scales carry meaningful information (e.g., financial data).

If you have huge compute resources try a grid search with the below:
params = {
‘max_depth’: [3, 5, 7, 9],
‘learning_rate’: [0.001, 0.01, 0.05, 0.1],
“gamma”: [0, 0.5, 1],
‘n_estimators’: [100, 500, 1000],
“subsample”: [0.7, 0.8, 1.0],
“colsample_bytree”: [0.6, 0.8, 1.0],
‘min_child_weight’: [1, 5, 10],
“early_stopping_rounds”: [10], # can be part of fit() instead
“random_state”: [RANDOM_SEED],
}

For example, you may want to expand the parameter grid carefully, focusing on the most impactful hyperparameters for XGBoost (if you use that among others):

  • max_depth: Can explore wider values (e.g., 3 to 10).
  • learning_rate: Smaller steps like [0.001, 0.01, 0.05, 0.1] can provide better control.
  • gamma: Fine-tune values, focusing on [0, 0.5, 1, 1.5].
  • n_estimators: Test a broader range if resources allow (e.g., [50, 100, 200, 500, 1000]).
  • subsample and colsample_bytree: These help prevent overfitting and should remain in your grid.
  • min_child_weight: Including this is critical for controlling tree growth, e.g., [1, 5, 10].

Hello,

updated code with your suggestions, now the model performance is better!

AuC: 0.7196
Train - Accuracy: 0.7940 Precision: 0.7857 Recall: 0.6027
CV - Accuracy: 0.6800 Precision: 0.5625 Recall: 0.5000

Code, if you wish to do other suggestions:

x_train, x_cv, y_train, y_cv = train_test_split(x_all, y_all, test_size=0.20, random_state=RANDOM_SEED, stratify=y_all)

cv_folds = 3

params = {
    'max_depth': [2, 3, 4],
    'learning_rate': [0.1, 0.3, 0.5],
    'n_estimators': [20, 50, 100],
    "subsample":[0.6, 0.8, 1.0],
    "colsample_bytree":[0.6, 0.8, 1.0],
    'min_child_weight': [3, 5, 8],
    'early_stopping_rounds':[10]
    }

# hypertuning
xgb_hyper =  XGBClassifier(random_state=RANDOM_SEED)

# Use GridSearchCV for all combinations
grid = GridSearchCV(
    estimator = xgb_hyper,
    param_grid = params,
    scoring = 'roc_auc', # or 'precision' or 'accuracy'
    n_jobs = -1,
    cv = cv_folds,
    verbose = 1,
    refit = True
)

# Model fitting
grid = grid.fit(x_train, y_train, eval_set=[(x_cv, y_cv)])

print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

y_hat_train = grid.best_estimator_.predict(x_train)
y_hat_cv = grid.best_estimator_.predict(x_cv)

# why best model and auc not the same????

print(f"AuC: {roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1]):.4f}")
print(f"Train - Accuracy: {accuracy_score(y_train,y_hat_train):.4f}\tPrecision: {precision_score(y_train,y_hat_train):.4f}\tRecall: {recall_score(y_train,y_hat_train):.4f}")
print(f"CV    - Accuracy: {accuracy_score(y_cv,y_hat_cv):.4f}\tPrecision: {precision_score(y_cv,y_hat_cv):.4f}\tRecall: {recall_score(y_cv,y_hat_cv):.4f}")

Best params: {‘colsample_bytree’: 1.0, ‘early_stopping_rounds’: 10, ‘learning_rate’: 0.5, ‘max_depth’: 2, ‘min_child_weight’: 5, ‘n_estimators’: 20, ‘subsample’: 0.8}

I do get a difference between grid.best_score_ 0.8094 and roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1]) 0.7196
Not sure about why, shouldn’t them be the same?

1 Like
  1. Once the best_score_ is calculated, refit=True of GridSearchCV creates a new estimator with best_params_ and fits the entire dataset. This is why best_score_ and roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1]) are different.
  2. It would be nice to use 10 fold for cv or 5 if you’re constrained by resources.

Questions:

  1. What was the previous value of Auc before .7196?
  2. Why not explore a wider parameter grid? Try HalvingGridSearchCV to get an estimate on a huge grid if you are constrainted by resources.

Hello,

Before the AuC was around 0.6 sorry I haven’t wrote it down and not sure about the decimals

I tried to comment out only refit=True but still the auc are different

Tried with a larger grid and 10 cv, it’s getting nearer to 0.8!

cv_folds = 10

params = {
    'max_depth': [2, 3, 4, 5],
    'learning_rate': [0.001, 0.05, 0.1, 0.2],
    'n_estimators': [10, 20, 30],
    "subsample":[0.4, 0.6, 0.8, 1.0],
    "colsample_bytree":[0.6, 0.8, 1.0],
    'min_child_weight': [2, 3, 4, 5, 6],
    'early_stopping_rounds':[10]
    }

# hypertuning
xgb_hyper =  XGBClassifier(random_state=RANDOM_SEED)

# Use GridSearchCV for all combinations
grid = GridSearchCV(
    estimator = xgb_hyper,
    param_grid = params,
    scoring = 'roc_auc', # or 'precision' or 'accuracy'
    n_jobs = -1,
    cv = cv_folds,
    verbose = 1,
    #refit = True
)

# Model fitting
grid = grid.fit(x_train, y_train, eval_set=[(x_cv, y_cv)])

print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

y_hat_train = grid.best_estimator_.predict(x_train)
y_hat_cv = grid.best_estimator_.predict(x_cv)

# why best model and auc not the same????

print(f"Train - Accuracy: {accuracy_score(y_train,y_hat_train):.4f}\tPrecision: {precision_score(y_train,y_hat_train):.4f}\
      \tRecall: {recall_score(y_train,y_hat_train):.4f}\tAuC: {roc_auc_score(y_train, grid.best_estimator_.predict_proba(x_train)[:,1]):.4f}")
print(f"CV    - Accuracy: {accuracy_score(y_cv,y_hat_cv):.4f}\tPrecision: {precision_score(y_cv,y_hat_cv):.4f}\
      \tRecall: {recall_score(y_cv,y_hat_cv):.4f}\tAuC: {roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1]):.4f}")

Best params: {‘colsample_bytree’: 1.0, ‘early_stopping_rounds’: 10, ‘learning_rate’: 0.1, ‘max_depth’: 4, ‘min_child_weight’: 3, ‘n_estimators’: 20, ‘subsample’: 0.4}
Best score: 0.7847641941391941
Train - Accuracy: 0.7800 Precision: 0.7959 Recall: 0.5342 AuC: 0.8382
CV - Accuracy: 0.7600 Precision: 0.8000 Recall: 0.4444 AuC: 0.7778

best_score_ refers to the mean cross validated score of the best_estimator_ (see this) and hence the difference in scores.
Please leave refit=True since it’s the right approach. Once the best params are found, you want to fit on the entire training dataset provided to GridSearchCV.

2 Likes

okkkk, now the difference in score results and in general the k-folds with this is a little bit more clear

Tested just now with refit=True, cv 10 and 263 patients I get this results:

Best score: 0.7667
Train - Accuracy: 0.7799 Precision: 0.8333 Recall: 0.4730 AuC: 0.8300
CV - Accuracy: 0.7358 Precision: 0.7273 Recall: 0.4211 AuC: 0.7508

Can’t wait to have access to all the dataset and reach around 1000 patients

Anyway in case one of the other centers have missing data (for example they forgot to write down the age of a few patients), how should I treat these missing values?

1 Like

Please see this.

1 Like