Thank you @Deepti_Prasad. Appreciate a lot to your suggestions! I did not mention too much on the features because I thought the meaning of each features do not matter too much. As long as it shows higher correlation to the membership status, we should put it into the model?
I ran a correlation analysis and selected the top 20 correlation for modelling.
VOL_ACTUAL INVOL_ACTUAL CONTINUE
OTHER_PAY 0.150662 NaN -0.118051
INVOL_POSSIBLE -0.142355 0.016845 0.103075
ONE_PRODUCT_AGE 0.137720 NaN -0.108078
MONTH_1_FLAG 0.092828 -0.007732 -0.069201
MONTH_2_FLAG 0.080453 -0.007400 -0.059549
VDR 0.079975 0.045110 -0.091243
MONTH_0_FLAG 0.074084 -0.008143 -0.054023
PRODUCT_AGE -0.067619 -0.046682 0.082363
TENURE -0.061022 -0.046162 0.076792
CH 0.055896 0.042566 -0.070513
V_OTHER -0.055784 -0.043773 0.071161
PAY_DEBIT -0.050499 -0.040004 0.064650
1_SPONSORSHIP -0.048125 -0.031288 0.057436
2_SPONSORSHIP 0.040916 0.032428 -0.052391
CH_OTHER -0.036099 -0.032116 0.048365
FIRST_PRODUCT_FLAG 0.034723 0.032542 -0.047529
INVOL_POSS_ONE__AGE 0.026018 0.027321 -0.037408
CH_TELEMARKETING -0.023676 -0.015571 0.028366
PAY_CREDIT -0.022139 0.040396 NaN
MONTHLY_FREQ 0.020086 0.011782 -0.023192
The code that I ran for SMOTE is below:
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
# scale the data
scaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)
X_dev_scale = scaler.transform(X_dev)
# SMOTE sample
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scale, y_train)
The result of Logistic Regression after SMOTE shows bad performance on minority class Precision (1.0 = Voluntary Discontinue ; 2.0 = Involuntary Discontinue):
lg_model = LogisticRegression(max_iter=1000)
lg_model = lg_model.fit(X_train_resampled, y_train_resampled)
y_pred = lg_model.predict(X_dev_scale)
print('LG with resampling')
print(classification_report(y_dev, y_pred))
LG with resampling
precision recall f1-score support
0.0 1.00 0.59 0.74 1100105
1.0 0.07 0.65 0.12 10847
2.0 0.01 0.73 0.03 6361
accuracy 0.59 1117313
macro avg 0.36 0.66 0.30 1117313
weighted avg 0.98 0.59 0.73 1117313
I also tried undersampling but the result does not have much difference:
from imblearn.under_sampling import RandomUnderSampler
# Initialize the RandomUnderSampler
under_sampler = RandomUnderSampler(random_state=42)
# Resample the dataset
X_train_undersampled, y_train_undersampled = under_sampler.fit_resample(X_train_scale, y_train)
lg_model = LogisticRegression(max_iter=1000)
lg_model = lg_model.fit(X_train_undersampled, y_train_undersampled)
y_pred = lg_model.predict(X_dev_scale)
print('LG with undersampling')
print(classification_report(y_dev, y_pred))
LG with undersampling
precision recall f1-score support
0.0 1.00 0.59 0.74 1100105
1.0 0.06 0.65 0.12 10847
2.0 0.01 0.73 0.03 6361
accuracy 0.59 1117313
macro avg 0.36 0.66 0.29 1117313
weighted avg 0.98 0.59 0.73 1117313
Decision Tree result:
dt_model = DecisionTreeClassifier()
dt_model = dt_model.fit(X_train_undersampled, y_train_undersampled)
y_pred = dt_model.predict(X_dev_scale)
print(classification_report(y_dev, y_pred))
precision recall f1-score support
0.0 0.99 0.50 0.66 1100105
1.0 0.04 0.66 0.08 10847
2.0 0.01 0.71 0.02 6361
accuracy 0.50 1117313
macro avg 0.35 0.62 0.25 1117313
weighted avg 0.98 0.50 0.65 1117313
For your recommenration on cross-validation, it did not go through due to
large dataset and memory limitation.
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
model_mean_accuracy=[]
model_std=[]
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
classifiers=['Logistic Regression','Decision Tree']
models=[LogisticRegression(max_iter=1000), dt_model]
for i, model in zip(classifiers, models):
cv_result = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
cv_result = cv_result
model_mean_accuracy.append(cv_result.mean())
model_std.append(cv_result.std())
# Print results
for i, classifier in enumerate(classifiers):
print(f"{classifier}: Mean Accuracy = {model_mean_accuracy[i]}, Std = {model_std[i]}")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[66], line 7
4 models=[LogisticRegression(max_iter=1000), dt_model]
6 for i, model in zip(classifiers, models):
----> 7 cv_result = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
8 cv_result = cv_result
9 model_mean_accuracy.append(cv_result.mean())
File ~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we replace
217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
...
array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
File "C:\Users\AnChan\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\_array_api.py", line 521, in _asarray_with_order
array = numpy.asarray(array, order=order, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 909. MiB for an array with shape (5959005, 20) and data type float64
I am not sure if this is a right approach. I feel like the variables itself maybe not good enough for prediction as they all showed quite low correlation to the y label. Therefore, no matter I used different sampling method to address imbalance dataset or tried other algorithm, it still returned a poor performance in classifying minority classes.
Looking forward to your further guidance. Thanks a lot!