Why not sort a dataset and pick initial centroids at spaced intervals?

I’m at Initializing K-means | Coursera
and the approach to picking the initial centroid is random which as noted can lead you to pick them from the same (eventual) cluster.
Wouldn’t it be an optimization to sort the dataset and then pick the initial points by picking them at spaced intervals?

OR am I making a naive observation because I’m looking at a 2D graph in the example and in reality, these are multi-dimensional vectors where there’s no logical way of actually “sorting” the data set to achieve this effect.

OR (there is a logical way to sort, that I don’t know of, but) as you would still need to run the algorithm large number times to minimize the cost function, this additional step doesn’t really optimize the performance.

1 Like

Unique random selections seem to work well.

Hello, @SameeraPerera,

I agree with you that it is difficult to pre-sort them for the reason you mentioned. In fact, K-means itself is the sorting algorithm, isn’t it? :wink: In this way, and given that we wouldn’t know where the clusters are in prior, we would be looking for another clustering algorithm to do the pre-sort before applying K-means. Would that be more effective than just running K-means for multiple times?

Also, picking two initial centroids from the same cluster don’t have to be a problem, because they may still eventually be drawn to two clusters.

Initializing two centroids to the same point is a problem, but that is almost impossible with random initialization. Initializing two centroids very closely may be a problem, but if we run K-means for multiple times (as you said, which is also an usual practice), then such case should be minor.

Hope this add some fuel to the thinking process!


1 Like