hi,
NG explains methods to choose the right K value and the importance of multiple random initializations of the kmeans to get the optimum clustering(one that gives pure and seemingly logical clusters)
but how do we do both? or which one should I do first? Should I first choose the right cluster value using domain knowledge and then move on to the multiple initialization process?
IT becomes confusing if I want to employ the elbow method for finding the optimum K value
hi @mehmet_baki_deniz , That’s a good question!
I would suggest you to first determine the appropriate K values range using domain knowledge or by employing methods like elbow or any other. Once you determined the optimal number of clusters, you can then perform multiple initialization process for each K value to obtain the optimum clustering. Furthermore, try to choose between different optimal K values by comparing and evaluating quality of clusters using suited metrics that gives best clustering among them.
One thing to note is that, it does not provide guarantee that optimum clustering will always give pure clusters and it depends on domain knowledge too how you interpret it as optimum.
Best Regards,
Mujassim
1 Like
thank you Mujassim
I checked scikitlearn documentation. scikit model has a parameter called " n_init
It seems the model initializes n_init times to find the best result given K. So we can just assign it to 100 as NG advises and then decide which K value to use with an appropriate method.
would you (guys) aggree with this interpreation?
You can set to n_init
parameter higher if you want to increase the chances of finding better solution and try different range of n_clusters
values. You can then use appropriate method for deciding K value. Also, scikit-learn kmeans initializes the centroids for each K value automatically to achieve optimum cluster, you have to just take care of only number of clusters.
Please Feel free to share the results.
Best Regards,
Mujassim
1 Like