Huge Imbalance Dataset Classification Questions

Hi all,
I am trying to predict number of members who will discontinue their membership. The whole dataset is about 12 millions rows of data with about 40 columns. A member status can be “Continue”, “Voluntary Discontinue” or “Involuntary Discontinue”. This dataset is highly imbalanced where 98% of member are “Continue”, about 1% for “Voluntary Discontinue” and “Involuntary Discontinue”. To reduce dimensionality, I have ran correlation analysis to select only 15 features with highest correlation for modelling.

Below are the problems I am facing:

  1. My colleagues used multinomial regression. However, he did not apply a threshold to convert probabilities into class label. Instead, he summed up all probabilities of individual members to estimate the number of predicted members who will Voluntary Discontinue or Involuntary Discontinue.
    I am not sure about this approach because I don’t quite get the meaning after summing individual probability instead of using a threshold. Is this approach correct given that we are interested in the total number of people. Also, how do we measure model performance with this approach

  2. I am treating this question as a classification problem. As it is imbalanced dataset and we are interested in the people who will discontinue, I used SMOTE resampling method. However, trying with logistic regression, decision tree and neural network, they all still have very low precision for class “Voluntary Discontinue” and “Involuntary Discontinue”. Any other ways I can increase the precision of minority class?

  3. I tried to run random forest or XGboost. However, it failed to run due to memory limitation. Any suggestions to tackle this kind of large dataset?

1 Like

When you mention this, does the data also contains a column of attendance or usuage i.r.t. membership???

Another approach to consider is to treat this not as a “classification” problem but as an “anomaly detection” problem.


No, the data does not have a column of attendance or usage. How is that related to the problem?

Thanks for your suggestion. Let me do some research on “anomaly detection”. But why do you think it is a anomaly detection? or what is the benefit to treat it as anomaly detection instead of classification?

1 Like

Precisely because the data is so unbalanced. That is exactly the situation you have in an anomaly detection problem: e.g. you have millions of charge card transactions and only a few hundred of them are fraudulent.

It’s been a long time since I learned about Anomaly Detection algorithms in Prof Ng’s original Stanford Machine Learning course, so I do not know what the current state of the art is for anomaly detection. The Stanford ML course reflected the SOTA as of late 2011 or early 2012, so it’s likely that there have been a few advances since then. :nerd_face: I think that topic is covered in the MLS specialization here, which is new as of a couple of years ago I think, but I have not taken MLS yet.


The reason I was asking because you have not provided much information about the columns and features you have used to classify your data.

You also mentioned using SMOTE resampling method, but you didn’t mention how you did the resampling.

As long as per the information given you want to check who would discontinue their membership and you have voluntary discontinue or involuntary discontinue accounting for only 2% of the dataset which comes under minority class if you planning to do a classification analysis.

One approach would be under-sampling the major class to check if there is any variation in model and if it is able to detect a pattern by the members ( I didn’t advised oversampling of minor class as it would create bias/variance of dependent issue due to huge imbalance in the dataset.

Most of the people actually get imbalanced dataset, and their first approach is to calculate F1 statistics to your dataset

You could also try dividing the major class equally on cross-validation and test dataset, and use the 2% dataset of minor class on both cross-validation and test dataset.

you could using weights arguments in the classification function to penalise the algorithm for misclassification of rare major classes

use cost argument in classification algorithm but you would require to set a high cost value for misclassification of the rare class.

No matter what, in highly imbalanced dataset, you could always try two approaches and compare each of them, and see which one is doing better. Like the one mention SMOTE not doing well, you could use algorithm with logistic regression, K-Nearest Neighbour classification, Support vector machine and decision tree classifier.

Evaluation metrics to be used - precision, recall, f1-score, area under ROC curse and confusion matrices.

If you provide more information on your columns and features, better statistical approach can be decided.


Hello @ansonchantf,

What’s the time span of this dataset? How did the “discontinued” rate change over time (if possible, sharing a graph would be nice)? From your exploratory analysis with the original 40 features before dimensionality reduction, anything special about those discontinued members?


Thank you @Deepti_Prasad. Appreciate a lot to your suggestions! I did not mention too much on the features because I thought the meaning of each features do not matter too much. As long as it shows higher correlation to the membership status, we should put it into the model?

I ran a correlation analysis and selected the top 20 correlation for modelling.

                              VOL_ACTUAL  INVOL_ACTUAL  CONTINUE
OTHER_PAY                       0.150662           NaN -0.118051
INVOL_POSSIBLE                 -0.142355      0.016845  0.103075
ONE_PRODUCT_AGE                 0.137720           NaN -0.108078
MONTH_1_FLAG                    0.092828     -0.007732 -0.069201
MONTH_2_FLAG                    0.080453     -0.007400 -0.059549
VDR                             0.079975      0.045110 -0.091243
MONTH_0_FLAG                    0.074084     -0.008143 -0.054023
PRODUCT_AGE                    -0.067619     -0.046682  0.082363
TENURE                         -0.061022     -0.046162  0.076792
CH                              0.055896      0.042566 -0.070513
V_OTHER                        -0.055784     -0.043773  0.071161
PAY_DEBIT                      -0.050499     -0.040004  0.064650
1_SPONSORSHIP                  -0.048125     -0.031288  0.057436
2_SPONSORSHIP                   0.040916      0.032428 -0.052391
CH_OTHER                       -0.036099     -0.032116  0.048365
FIRST_PRODUCT_FLAG              0.034723      0.032542 -0.047529
INVOL_POSS_ONE__AGE             0.026018      0.027321 -0.037408
CH_TELEMARKETING               -0.023676     -0.015571  0.028366
PAY_CREDIT                     -0.022139      0.040396       NaN
MONTHLY_FREQ                    0.020086      0.011782 -0.023192

The code that I ran for SMOTE is below:

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

# scale the data
scaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)
X_dev_scale = scaler.transform(X_dev)

# SMOTE sample
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scale, y_train)

The result of Logistic Regression after SMOTE shows bad performance on minority class Precision (1.0 = Voluntary Discontinue ; 2.0 = Involuntary Discontinue):

lg_model = LogisticRegression(max_iter=1000)
lg_model =, y_train_resampled)
y_pred = lg_model.predict(X_dev_scale)
print('LG with resampling')
print(classification_report(y_dev, y_pred))

LG with resampling
              precision    recall  f1-score   support

         0.0       1.00      0.59      0.74   1100105
         1.0       0.07      0.65      0.12     10847
         2.0       0.01      0.73      0.03      6361

    accuracy                           0.59   1117313
   macro avg       0.36      0.66      0.30   1117313
weighted avg       0.98      0.59      0.73   1117313

I also tried undersampling but the result does not have much difference:

from imblearn.under_sampling import RandomUnderSampler

# Initialize the RandomUnderSampler
under_sampler = RandomUnderSampler(random_state=42)

# Resample the dataset
X_train_undersampled, y_train_undersampled = under_sampler.fit_resample(X_train_scale, y_train)

lg_model = LogisticRegression(max_iter=1000)
lg_model =, y_train_undersampled)
y_pred = lg_model.predict(X_dev_scale)
print('LG with undersampling')
print(classification_report(y_dev, y_pred))

LG with undersampling
              precision    recall  f1-score   support

         0.0       1.00      0.59      0.74   1100105
         1.0       0.06      0.65      0.12     10847
         2.0       0.01      0.73      0.03      6361

    accuracy                           0.59   1117313
   macro avg       0.36      0.66      0.29   1117313
weighted avg       0.98      0.59      0.73   1117313

Decision Tree result:

dt_model = DecisionTreeClassifier()
dt_model =, y_train_undersampled)
y_pred = dt_model.predict(X_dev_scale)
print(classification_report(y_dev, y_pred))
              precision    recall  f1-score   support

         0.0       0.99      0.50      0.66   1100105
         1.0       0.04      0.66      0.08     10847
         2.0       0.01      0.71      0.02      6361

    accuracy                           0.50   1117313
   macro avg       0.35      0.62      0.25   1117313
weighted avg       0.98      0.50      0.65   1117313

For your recommenration on cross-validation, it did not go through due to
large dataset and memory limitation.

from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np
from sklearn.model_selection import KFold, cross_val_score

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

classifiers=['Logistic Regression','Decision Tree']
models=[LogisticRegression(max_iter=1000), dt_model]

for i, model in zip(classifiers, models):
    cv_result = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
    cv_result = cv_result

# Print results
for i, classifier in enumerate(classifiers):
    print(f"{classifier}: Mean Accuracy = {model_mean_accuracy[i]}, Std = {model_std[i]}")

ValueError                                Traceback (most recent call last)
Cell In[66], line 7
      4 models=[LogisticRegression(max_iter=1000), dt_model]
      6 for i, model in zip(classifiers, models):
----> 7     cv_result = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
      8     cv_result = cv_result
      9     model_mean_accuracy.append(cv_result.mean())

File ~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "C:\Users\AnChan\AppData\Roaming\Python\Python39\site-packages\sklearn\utils\", line 521, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 909. MiB for an array with shape (5959005, 20) and data type float64

I am not sure if this is a right approach. I feel like the variables itself maybe not good enough for prediction as they all showed quite low correlation to the y label. Therefore, no matter I used different sampling method to address imbalance dataset or tried other algorithm, it still returned a poor performance in classifying minority classes.

Looking forward to your further guidance. Thanks a lot!

Were there no missing data?

I don’t see you checking if any null values were present.

Can you give me a list of all columns?

Basically your model is failing more because of too large dataset.

You could select dataset based on time frame(months/year) reduce it and then perform with every 20 columns each, see if you notice any pattern.

Also I noticed NaN in your columns of continuous, so make sure you check those 12 millions rows dataset for datatype difference, any empty values or object type. Make sure the columns you are selecting are same datatype. Then how you selected the top 20 columns you didn’t mention. Go through the column names, you half answer will be there as those will hold significance in your analysis as you have still not told what kind of membership the data is(even that will give you idea about what columns to choose)

I feel you will find a solution if you keep exploring the dataset a little more, just that approach needs to more versatile but selective.

Will reply more precisely once I get response to this query on columns details and why you chose random state=42, and what membership is this about as I see even continuous data is have negative values and NaN I highly suspect if it is a credit card membership data :crazy_face:


1 Like

Thank you @Deepti_Prasad for your quick response. I will try to provide more context here for you and @rmwkwok and hope this can help clarify!

It is a charity donations dataset. It originally contains transactional data by month from 2019 to 2023, which includes 12 millions rows and 72 columns of data. To speed up exploratory analysis, I only import data from 2020 to 2023 (still 7.5 millions) for this predictive modelling project.

With domain knowledge decision by my colleagues, 46 possible X variables are selected for further EDA, and honestly I don’t understand the meaning of each columns. I did not list out all columns name because I am not able to explain what it means and which could be more relevant to the modelling. I still list out here for your reference.

       'CH_INT', 'CH_TELE', 'CH_OTHER', 'VDR',
       'V_2', 'V', 'V_1',
       'VDR_3', 'VDR_4', 'V_OTHER', 'PRODUCT_AGE',
       'CNT_P12M', 'AMT_P12M', 'LETTERS', 'RT_INCREASE',

To reduce dimensionality, I ran correlation analysis, and only based on the correlation to include top 20 variables into the model. After that, I did the data cleaning work to drop missing value for modelling. Therefore, the result in correlation analysis still includes missing value. I simply dropped it because the missing data is less than 1% and for quick exploratory I did not spend too much time to do imputation.

The selected top 20 correlation variables data type for your reference.

OTHER_PAY                       float64
INVOL_POSSIBLE                     int8
ONE_PRODUCT_AGE                 float64
MONTH_1_FLAG                       int8
MONTH_2_FLAG                       int8
VDR                                int8
MONTH_0_FLAG                       int8
PRODUCT_AGE                       int16
TENURE                          float64
CH_FACE                            int8
V_OTHER                            int8
PAY_DEBIT                       float64
1_SPONSORSHIP                      int8
2_SPONSORSHIP                      int8
CH_OTHER                           int8
FIRST_GIFT_FLAG                    int8
INVOL_POSS_ONE__AGE             float64
CH_TELEMARKETING                   int8
PAY_CREDIT                      float64
MONTHLY_FREQ                    float64
Y                               float64

Besides, the random state=42 is just arbitrary. Does this matter?

I agree with you that it is important in the variables selection. I will definitely spend more time to understand the columns to add any variables that should makes much more sense to included but not reflected in the correlation analysis.

Hope this clarifies.

@rmwkwok I just ran a Discontinued Rate change over time. For the fluctuation of class label 2 (Involuntary Discontinue), I am not sure about the reason.

Appreciate your comments!

Okay! @ansonchantf, now we can have something to discuss.

First, let’s forget about the correlation-driven dimensionality reduction for now - the approach itself may be too sensitive to any systematic shift that can happen to your dataset and that sensitivity can make your model less robust.

What worried me more on your graph is the increasing trend in the blue line - it is a systematic shift. It means that whatever model can best explain in the period of 2021 and 2022 may fail to explain year 2024, and more importantly, we care about year 2024 and 2025 more than we care about 2021 and 2022.

So, on the one hand, you said you wanted to reduce memory use, then focus on the last one or two years of data first. On the other, you probably would like to find out if any of those 46 factors can explain the trend and discuss your findings with the domain experts. After the discussion, see if you can build up new features that can even better explain the trend than the existing ones.

I am sorry but this is prohitbited… I mean, you don’t need to explain them to us, but you need to understand each and every one of them to the heart. It is not an option to not understand them. My experience is that, with a extensive EDA, you can even understand something that your domain experts were not aware of.

I mean, you can share whatever you like to, but you don’t have to explain them here. However, you need to understand them.

Yes, it can matter, but no, it does not matter now, and it shouldn’t matter much in the future too. Let’s not worry about that for now.

Agreed. If I were you, I would focus on just the data in the last 1 year, 1.5 year, or 2 years depending on available memory, then redo the exploratory analysis and discuss your latest findings with your domain experts.


1 Like

@ansonchantf, one suggestion for your EDA: based on discussion with domain experts, see if you can separate out subgroups that don’t have that increasing trend. It’s possible that a subgroup has a stable trend, one a decreasing trend, but they are just buried by the subgroup that has an increasing trend. Sub-grouping is meaningful when those groups have compariable sizes.

PS: I didn’t suggest any next step for us here, but if you have something to share like your last post, we can discuss them.

1 Like

How did you decide that the top-20 features were sufficient to capture the complexity of the data set?

1 Like

@rmwkwok Thanks a lot for your detailed response. It is very helpful for my next step. There are few things may not be related to the project but makes me confused

  1. You mentioned about systematic shift. Could you explain a bit more on the concept? Is that the issue of non-stationarity? What is the best practice to address it? For example, housing price in some markets shows an increasing trend over time. This should be non-stationary and impact machine learning algorithm, eg. linear regression? Should we avoid using regression or use tree-based model instead? Should we always check the stationarity in prediction task ( I am looking for some check list/ guideline when I face a prediction project).

  2. It sounds like reducing dimensionality based on correlation is not good enough. From my knowledge, I understand we can use correlation and PLC to select variables. What is the best practice to choose variables/ reduce dimensionality apart from domain knowledge?

Hello @ansonchantf,

I have to leave now for a meeting, but I will get back to you later today. Leave more message here if you have. :wink:

They are good questions!


1 Like

The 20 number is arbitrary again. What is a better way to decide the number of features?

Appreciate your time to answer my questions! One more thing that confuses me a lot is the approach that my colleague used to estimate number of discontinue members.

  1. My colleagues used multinomial regression. However, he did not apply a threshold to convert probabilities into class label. Instead, he summed up all probabilities of individual members to estimate the number of predicted members who will Voluntary Discontinue or Involuntary Discontinue.
    I am not sure about this approach because I don’t quite get the meaning after summing individual probability instead of using a threshold. Is this approach correct given that we are interested in the total number of people. Also, how do we measure model performance with this approach

Appreciate a lot if you can explain further on this question! :raised_hands:

Given the state of the art in computing technology, there is rarely a good reason to eliminate any features.

Exceptions would be if you have a gigantic data set, or a very meager computer (either in memory or processing speed).

In general, first try using all of the features, see if you get a good fit, and only eliminate features if you can prove they are causing a bottleneck.

1 Like

Hello @ansonchantf,

Let’s be clear about one thing: we are predicting who will discontinue, not how many of them will discontinue.

With the model, we only predict who will discontinue, then base on which we produce a secondary statistics for how many of them will discontinue. Please don’t say that the model predicts how many will discontinue. Are we clear about the difference?

For the secondary statistics

  1. you sum the probabilies up, then divide by the number of probabilities, you get a mean probability.
  2. you use threshold, and do the count, and divide by the number of predictions, you get another mean probability.

The thing is, they are point estimates, and we don’t know how far off they are from the true mean. You may always compare these two probabilities with some samples to tell how accurate they are. For example, you calculate these two mean probabilities for May 2024, then on June, you compare them with your record for May.

I don’t have a universal checklist.

Yes, we always need to check what’s going on with our data.

If the model predicted the number of discontinues, and it is increasing, then you need to worry about your model being unable to extrapolate to an unseen level.

If the model predicts who will discontinue, and it is becoming more often to happen, you need to worry about whether the model is addressing the cause of the change. We will just focus on this case in below.

There are two possibilities that you observe a trend:

  1. The underlying mechanism is changing
  2. The mechanism has not changed, but some factors become more active

Given that machine learning model training is to fit the model well to the training data, for 1, you don’t want any data representing the old norm to be used for training.

So, my suggestion for you to only use the recent years is precautionary. If you can find out the cause, you can make more informed decisions about what to do with the rest of the data. If you can find out the cause, you may change your modeling approach accordingly.

You can search for them on the Internet easily, and I don’t want to repeat them. However, I am not the guy who believe that algorithmic dimensionality reduction is more reliable than making decisions based on domain knowledge and inspection.

From me, you won’t hear about how to automatically process your data without careful inspection :wink: