Practical questions about K-means cost function

Hello everyone!!! I have two practical doubts about the cost function of K-means algorithm.

1)The cost function is computed as the mean of the distances between a cluster centroid and the points assigned to it. However, from what I’ve understood, if we have 3 clusters, we would end with a vector

J = [cost_for_cluster_1, cost_for_cluster_2, cost_for_cluster_3]

If we want to compare J between two different initial assignments of the 3 cluster centroids, we simply compute

J_Kassigment_1 = np.mean(J_1)
J_Kassigment_2 = np.mean(J_2)

and compare them?

2)When we run a k-means algorithm for n iterations, say 10 iterations, it ends up with a vector J after each iteration, but it is only the vector J after the 10th iteration that is compared between k-means algorithms with different centroids initializations? Moreover, how do we know which number of iterations is adequate to shift centroids location to the ones that minimize J?

Thank you very much for your work!!
Cheers,
Tony

  1. At each iteration, every example is assigned to the closest centroid. There isn’t really a “cost function” for K-means. We just compute the squared distance to each centroid, then assign the example to the nearest centroid. We use the square distance because there’s no advantage in computing the squared root.

  2. I recommend you not persist in considering the K-means cost as a vector. That’s an implementation detail. And it isn’t really a “cost” value.

The number of iterations is determined by experimentation. Or you could inject a rule that you stop iterating once the examples stop being assigned to a different cluster.

3 Likes

Hi TMosh! Thank you very much for your reply!

So, if I perform 100 k-means runs with different initial centroids assignments and end up with 100 arrays of distances between the centroids and their assigned data points, how do I know which run had the overall best performance?

As you can see in the uploaded picture, the circled run provides a better clustering of the data (which I understood produces a lower cost function). The outcomes of these runs are the array of the distances between the 3 centroids and their data points. However, consider we want to compare the clustering performed by the run “under_J1” and the run “under_J2”, it appears that overall under_J1 has a better performance but if I have to examine the distances for each centroid, it appears that for the green cluster the distance is lower for under_J2. So the question is how do I get a measure of the overall performance of a k-means run to be compared with other runs?

Cheers,
Tony

The method you use depends on whether you care about the number of examples in each cluster.

You may (or may not) want to use the normalized the sum of the squares of the distances within each cluster.

1 Like

Hello Tony @t_09,

You might also google “clustering metrics” for some ways such as Silhouette score to compare their performance.

Cheers,
Raymond

1 Like