Hello,
I am a cardiologist and currectly working in a heart failure project.
There are 6 centers in italy, I am managing one of them, each center will get around 150-300 patients, so we should get around 1000 patients
Right now I am working only with the data of my center (about 240 patients).
We have a few patients characteristics like sex, height, weight, hypertention, cancer, ecc and we already know if they have heart failure
This is my code, how do you think I can improve it?
As you can see I have already put a lot of question marks because I am not sure which is the best way to do some steps
#regularize
scaler = StandardScaler()
x_all_scaled = scaler.fit_transform(x_all)
x_train, x_cv, y_train, y_cv = train_test_split(x_all_scaled, y_all, test_size=0.20, random_state=RANDOM_SEED)
# hypertuning
# don't add too much stuff
params = {
'max_depth': [3, 4, 5],
'learning_rate': [0.001, 0.01, 0.1],
"gamma":[0.5, 1, 1.5],
'n_estimators': [100, 500],
#"subsample":[0.6, 0.8, 1.0],
"colsample_bytree":[0.6, 0.8, 1.0],
"early_stopping_rounds": [10],
"random_state": [RANDOM_SEED],
#'min_child_weight': [1, 5, 10],
}
cv_folds = 3
# hypertuning
xgb_hyper = XGBClassifier() # add objective='binary:logistic'????
skf = StratifiedKFold(n_splits=cv_folds, random_state=RANDOM_SEED, shuffle=True)
# Use GridSearchCV for all combinations
grid = GridSearchCV(
estimator = xgb_hyper,
param_grid = params,
scoring = 'roc_auc', # or 'precision' or 'accuracy'?
n_jobs = -1,
cv = skf,
verbose = 1,
)
# Model fitting
grid = grid.fit(x_train, y_train, eval_set=[(x_cv, y_cv)])
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")
y_hat_train = grid.best_estimator_.predict(x_train)
y_hat_cv = grid.best_estimator_.predict(x_cv)
# why best model and auc not the same????
# why precision and recall is 0????
print(f"AuC: {roc_auc_score(y_cv, grid.best_estimator_.predict_proba(x_cv)[:,1])}")
print(f"Train - Accuracy: {accuracy_score(y_train,y_hat_train):.4f}\tPrecision: {precision_score(y_train,y_hat_train):.4f}\tRecall: {recall_score(y_train,y_hat_train):.4f}")
print(f"CV - Accuracy: {accuracy_score(y_cv,y_hat_cv):.4f}\tPrecision: {precision_score(y_cv,y_hat_cv):.4f}\tRecall: {recall_score(y_cv,y_hat_cv):.4f}")