What will be good machine learning algrothim for this distribution

Dear all,

I have 10 features for 3 classes and all of them have a distribution similar to this (green, blue and orange is for class 0,1,2):


I try with NBGausian and LogisticRegression however it looks likes that this algorithm is not good enough for them. Do you have any suggested algorithms for those?

Thank you all

Tell us more about your data. And what you want to achieve by using AI for this problem?

The data is numerical data. I want to solve the classification problem using that data. Here is the result when i applied GaussianNB to my data. However, the accuracy is only around 69%

I have no experience with GaussianNB but you can use Neural Network with softmax activation function in the last layer.

Start with a simple model, just one hidden layer with a few neurons. Then tune the hyperparameters according to the results you get.


Which metrics do you focus on depending on the project. Accuracy, precision or recall

It’s a personal choice. I mostly use accuracy and a few times I use precision.

Yeah :+1: very correct about that

I focus on accuracy and sensitivity mostly

I agree with @saifkhanengr. It would be useful to understand:

  • how your feature dependencies look like
  • how much data you have
  • how much domain knowledge is available (which you can potentially encode in your features)

So what would be helpful is if you could provide e.g. a scatterhist plot for all your features and labels.

Something like this (only that you would have three labels instead of one):

Taking a look not only at the distribution of label classes but also at the distribution and dependency of your features in addition will help to understand the data better which is an essential step in the CRISP-DM which helps you to prepare better data to succeed in modelling.

So after analysing your data, probably you can judge then better:

Please let me know if this helps!

Best regards

1 Like

Also this specific example with three classes might be worth a look, @HOANG_GIA_MINH1:

and also: GMM classification β€” scikit-learn 0.15-git documentation

Best regards

1 Like

Here is my scatter matrix. I agree with you that they overlap when plotting in scatter. However, it is really interesting that even scatter matrix looked overlapped, the statistical test still showed a significant difference.

Thanks for the update, @HOANG_GIA_MINH1!

What does the color red/blue mean in your example? (In case it would be train / test it would not be a random shuffle from what I see)

Some hints:

  • I guess it can also be helpful to encode the class information in the color in that plot so that you can visualise underlying patterns
  • you could steer the transparency of your scatterplot with the alpha parameter when plotting your dataframe w/ scattermatrix
  • there seems to be highly redundant information in your features ADAS 13 and ADAS 11. in general you could analyze and remove redundancy with PCA or PLS approaches
  • it could be interesting to calculate feature importance, e.g. with shap plot
  • the gaussian mixture models linked above might be worth a try if you figured out your most important features

Hope that helps!

Best regards