Why not sort a dataset and pick initial centroids at spaced intervals?

SameeraPerera · July 6, 2024, 12:49pm

I’m at Initializing K-means | Coursera
and the approach to picking the initial centroid is random which as noted can lead you to pick them from the same (eventual) cluster.
Wouldn’t it be an optimization to sort the dataset and then pick the initial points by picking them at spaced intervals?

OR am I making a naive observation because I’m looking at a 2D graph in the example and in reality, these are multi-dimensional vectors where there’s no logical way of actually “sorting” the data set to achieve this effect.

OR (there is a logical way to sort, that I don’t know of, but) as you would still need to run the algorithm large number times to minimize the cost function, this additional step doesn’t really optimize the performance.

TMosh · July 6, 2024, 5:32pm

Unique random selections seem to work well.

rmwkwok · July 12, 2024, 2:34am

Hello, @SameeraPerera,

I agree with you that it is difficult to pre-sort them for the reason you mentioned. In fact, K-means itself is the sorting algorithm, isn’t it? In this way, and given that we wouldn’t know where the clusters are in prior, we would be looking for another clustering algorithm to do the pre-sort before applying K-means. Would that be more effective than just running K-means for multiple times?

Also, picking two initial centroids from the same cluster don’t have to be a problem, because they may still eventually be drawn to two clusters.

Initializing two centroids to the same point is a problem, but that is almost impossible with random initialization. Initializing two centroids very closely may be a problem, but if we run K-means for multiple times (as you said, which is also an usual practice), then such case should be minor.

Hope this add some fuel to the thinking process!

Cheers,
Raymond

Topic		Replies	Views
Initializing Cluster centroids for 3D arrays Unsupervised Learning, Recommenders, Reinforcement week-module-1	1	504	January 14, 2023
Course 3 week 1 : initializing K means vs choosing the right K Unsupervised Learning, Recommenders, Reinforcement week-module-1	3	479	April 12, 2023
How different initialization of centroids of K-means results in drastic different clusters ? They all share common cost function Unsupervised Learning, Recommenders, Reinforcement week-module-1	14	850	November 28, 2022
C3_W1_KMeans_Assignment lab Unsupervised Learning, Recommenders, Reinforcement week-module-1	6	252	April 2, 2024
Understanding K-mean clusters Unsupervised Learning, Recommenders, Reinforcement week-module-1	4	539	January 7, 2023

Why not sort a dataset and pick initial centroids at spaced intervals?

Related topics