Hello @ansonchantf
I don’t know your experience or expertise, but when you have huge dataset like this(forget about balanced or imbalanced), first doing data analysis is more important.)
To find what features would be the best or right for this classification model, you should have down dataset distribution analysis. This could be done different way.
- One most common approach is using statistical test for feature importance. Like here Chi-Square Test: This test is applied when you have two categorical variables from a population. It is used to determine whether there is a significant association or relationship between the two variables.
- Fisher’s exact test assesses the null hypothesis of independence applying hypergeometric distribution of the numbers in the cells of the table.
- A z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large.
I would have chosen to do all three for comparison. But Chi-square statistic will give maximum information to your data as it is a non-parametric (distribution free) tool designed to analyse group differences when the dependent variable is measured at a nominal level. Like all non-parametric statistics, the Chi-square is robust with respect to the distribution of the data.
Once you gain information from these statistical distribution on which features hold value with your dataset, you plan your next step of what kind of model you want to create or it also gives you an idea if classification model is possible to create or not.
Regards
DP