Huge Imbalance Dataset Classification Questions

Hello @ansonchantf

I don’t know your experience or expertise, but when you have huge dataset like this(forget about balanced or imbalanced), first doing data analysis is more important.)

To find what features would be the best or right for this classification model, you should have down dataset distribution analysis. This could be done different way.

  1. One most common approach is using statistical test for feature importance. Like here Chi-Square Test: This test is applied when you have two categorical variables from a population. It is used to determine whether there is a significant association or relationship between the two variables.
  2. Fisher’s exact test assesses the null hypothesis of independence applying hypergeometric distribution of the numbers in the cells of the table.
  3. A z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large.

I would have chosen to do all three for comparison. But Chi-square statistic will give maximum information to your data as it is a non-parametric (distribution free) tool designed to analyse group differences when the dependent variable is measured at a nominal level. Like all non-parametric statistics, the Chi-square is robust with respect to the distribution of the data.

Once you gain information from these statistical distribution on which features hold value with your dataset, you plan your next step of what kind of model you want to create or it also gives you an idea if classification model is possible to create or not.


Hello @ansonchantf

ok great so you do have too many columns of data to analyse :slight_smile:

I cannot judge on your colleagues choice of column selection as I don’t know his criteria. But leaving that apart, even if your domain of expertise is less in this, you doing a statistical testing on the data would have helped to gain information on which columns holds more value to your data be it 40 or 70, the only understanding you need to have like, if number of features or columns are less than 30, you do t-test statistical test, if more than 30 then you do z-test(information about this is in other comment response, so kindly go through)

Actually you could even just use null hypothesis and remove some of columns from selection for your data modelling, and then move to higher statistical testing like Chi-square or Z-test or Fischer test.

Do these testing first, then once you have your variables related to your data, only on those variable/columns/features, do your correlation analysis, you will have better result.

Honestly you only have 4 years of data, so it is not much of data if you compare to time-series. But your challenge here is more related to column selection, so focus on that to understand and do the right selection after your data analysis.

I don’t know if you know SAS programming, but statistical analysis on that platform I feel more better than python in comparative analysis.


Thanks a lot for your advice! This is more than enough for my project and learning. I will go back to study and revise. Appreciate that!

@Deepti_Prasad @rmwkwok @paulinpaloalto @TMosh :raised_hands: