What will be good machine learning algrothim for this distribution

HOANG_GIA_MINH1 · May 26, 2023, 1:21am

Dear all,

I have 10 features for 3 classes and all of them have a distribution similar to this (green, blue and orange is for class 0,1,2):

I try with NBGausian and LogisticRegression however it looks likes that this algorithm is not good enough for them. Do you have any suggested algorithms for those?

Thank you all

saifkhanengr · May 26, 2023, 6:21am

Tell us more about your data. And what you want to achieve by using AI for this problem?

HOANG_GIA_MINH1 · May 26, 2023, 6:59am

The data is numerical data. I want to solve the classification problem using that data. Here is the result when i applied GaussianNB to my data. However, the accuracy is only around 69%

saifkhanengr · May 26, 2023, 7:08am

I have no experience with GaussianNB but you can use Neural Network with softmax activation function in the last layer.

Start with a simple model, just one hidden layer with a few neurons. Then tune the hyperparameters according to the results you get.

jerryola1 · May 27, 2023, 9:57am

Which metrics do you focus on depending on the project. Accuracy, precision or recall

saifkhanengr · May 27, 2023, 10:06am

It’s a personal choice. I mostly use accuracy and a few times I use precision.

jerryola1 · May 27, 2023, 12:58pm

Yeah very correct about that

HOANG_GIA_MINH1 · May 27, 2023, 1:11pm

I focus on accuracy and sensitivity mostly

Christian_Simonis · May 27, 2023, 4:03pm

I agree with @saifkhanengr. It would be useful to understand:

how your feature dependencies look like
how much data you have
how much domain knowledge is available (which you can potentially encode in your features)

So what would be helpful is if you could provide e.g. a scatterhist plot for all your features and labels.

Something like this (only that you would have three labels instead of one):

Taking a look not only at the distribution of label classes but also at the distribution and dependency of your features in addition will help to understand the data better which is an essential step in the CRISP-DM which helps you to prepare better data to succeed in modelling.

So after analysing your data, probably you can judge then better:

data processing wise: how to improve your features and how important your features are, see also CRISP-DM:

image828×860 63.8 KB
modelling wise: if e.g. a Gaussian mixture model might be helpful and with which (selected or transformed) features you want to feed it, see: 2.1. Gaussian mixture models — scikit-learn 1.3.2 documentation.

Please let me know if this helps!

Best regards
Christian

Christian_Simonis · May 27, 2023, 4:07pm

Also this specific example with three classes might be worth a look, @HOANG_GIA_MINH1:

and also: GMM classification — scikit-learn 0.15-git documentation

Best regards
Christian

HOANG_GIA_MINH1 · May 28, 2023, 10:26am

Here is my scatter matrix. I agree with you that they overlap when plotting in scatter. However, it is really interesting that even scatter matrix looked overlapped, the statistical test still showed a significant difference.

Christian_Simonis · May 28, 2023, 11:15am

Thanks for the update, @HOANG_GIA_MINH1!

What does the color red/blue mean in your example? (In case it would be train / test it would not be a random shuffle from what I see)

Some hints:

I guess it can also be helpful to encode the class information in the color in that plot so that you can visualise underlying patterns
you could steer the transparency of your scatterplot with the alpha parameter when plotting your dataframe w/ scattermatrix
there seems to be highly redundant information in your features ADAS 13 and ADAS 11. in general you could analyze and remove redundancy with PCA or PLS approaches
it could be interesting to calculate feature importance, e.g. with shap plot
the gaussian mixture models linked above might be worth a try if you figured out your most important features

Hope that helps!

Best regards
Christian

Topic		Replies	Views
C1_W2, why would Naive Bayes perform poorly for the given dataset distribution NLP with Classification and Vector Spaces course-related , week-module-2	1	189	May 8, 2024
Test Accuracy Higher than Train accuracy? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	42	2585	August 9, 2024
Multivariate normal distribution vs Gaussian Mixture Models Unsupervised Learning, Recommenders, Reinforcement week-module-1	1	644	August 30, 2022
Automatically measure how Gaussian a distribution is Unsupervised Learning, Recommenders, Reinforcement week-module-1	1	508	August 15, 2022
Why Non-Gaussian feature of a dataset can be problematic during model training? AI Discussions feedback , ai-discussions , project	1	26	April 30, 2025

What will be good machine learning algrothim for this distribution

Related topics