Hello everyone!!! I have two practical doubts about the cost function of K-means algorithm.
1)The cost function is computed as the mean of the distances between a cluster centroid and the points assigned to it. However, from what I’ve understood, if we have 3 clusters, we would end with a vector
2)When we run a k-means algorithm for n iterations, say 10 iterations, it ends up with a vector J after each iteration, but it is only the vector J after the 10th iteration that is compared between k-means algorithms with different centroids initializations? Moreover, how do we know which number of iterations is adequate to shift centroids location to the ones that minimize J?
At each iteration, every example is assigned to the closest centroid. There isn’t really a “cost function” for K-means. We just compute the squared distance to each centroid, then assign the example to the nearest centroid. We use the square distance because there’s no advantage in computing the squared root.
I recommend you not persist in considering the K-means cost as a vector. That’s an implementation detail. And it isn’t really a “cost” value.
The number of iterations is determined by experimentation. Or you could inject a rule that you stop iterating once the examples stop being assigned to a different cluster.
So, if I perform 100 k-means runs with different initial centroids assignments and end up with 100 arrays of distances between the centroids and their assigned data points, how do I know which run had the overall best performance?
As you can see in the uploaded picture, the circled run provides a better clustering of the data (which I understood produces a lower cost function). The outcomes of these runs are the array of the distances between the 3 centroids and their data points. However, consider we want to compare the clustering performed by the run “under_J1” and the run “under_J2”, it appears that overall under_J1 has a better performance but if I have to examine the distances for each centroid, it appears that for the green cluster the distance is lower for under_J2. So the question is how do I get a measure of the overall performance of a k-means run to be compared with other runs?