Thank you for confirming that “K-means does not “always” converge the sum of squared distances to zero”. However, when K = 1 in the given example p1 = (1.0 1.0) p2 = (2.0 2.0) and µ1 = Final Cluster Centroid = (1.5, 1.5), then if we calculate the “sum of cosine distances” instead of “sum of squared distances” it converges to “zero”.
Yes, that’s an interesting idea. It seems to me you are good at looking into different possibilities! However, I still have to explain why that possibility is not the way out when we consider the statement “k-means converges to zero”.
1
I can also immediately propose another two-datapoint case that neither squared distance or your cosine distance is zero when k = 1. For example:
k-means algorithm does not specify which distance to use, but one thing for certain is that, for me to say “k-means converges to zero”, it has to be done all by k-means. It can’t be done by us changing the value of k, and it can’t be done by us changing how we measure distance. In other words, since k-means only adjusts the locations of centroid, we can say k-means converges to zero only when, given k, distance measurement and everything else fixed, k-means can always reduce the distance to zero by adjusting the locations.
Cheers,
Raymond
PS: Since k-means clusters datapoint by distance, if you changed to use cosine distance, you would also change the behavior of the result - is that behavior expected? I think making it zero isn’t everything about k-means, right? You need useful result as well, right?
In my analysis of the above example with K=1, I used different distance metrics such as Euclidean and Cosine. In both cases i observed that the K-means guarantees the cost to decrease or remain the same at each iteration. However, the cost function J is not always converges to zero. The K-means algorithm’s rules remains the same. But the final cluster identities and centroids results may changes, I think there may be a change in optimization objective of the K-means cost function. Is this acceptable?
As the K-Means algorithm does not specify which distance metric to use. I analyzed more about distance metrics for clustering. I reviewed the following journal “Engineering Applications of Artificial Intelligence” in the following section 6.3.1. Clustering similarity measures which outline common distance functions such as Euclidean, Cosine, Minkowski, Mahalanobis, Pearson correlation, and more. Here are the citation details: @article{EZUGWU2022104743,
title = {A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects},
Please see the following example. When K=3, using the example input X = np.array([[0,1], [1,2], [1,3], [8,4], [9,3], [8,2], [8,8], [8,6], [5,8], [7,7], [4,7], [3,8], [2,9], [6,9]]). We can see that K-means algorithm can be affected by different random initializations . This is evident when we use the python code line np.random.seed(200) in the “kMeans_init_centroids” function.
This example demonstrates a lack of convexity as the K-means algorithm can become trapped in suboptimal solutions or local optima as seen with the random seed 200.
For example,
When randomseed=9, The lowest squred distance error was 44.1667.
When randomseed=42, The lowest squred distance error was 45.4667
When randomseed=123, The lowest squred distance error was 44.1667
When randomseed=200, The lowest squred distance error was 117.75, the K-means algorithm got stuck in a local optimum with a lowest squared distance error of 117.75.
However when we use the k-means++ algorithm to choose initial cluster centroids for the K-means clustering algorithm, It avoided getting trapped in suboptimal solutions unlike random seed 200 which displays sudden spikes.
Can we determine if the below series as shown in the example, the function f(Squared Distance, Random Seed) exhibits discontinuities or sharp spikes at specific random seeds? I think at these points, the values such as 102.99, 102.99, 114.86, 117.75, 115.36, 114.86, and so on, this function cannot be differentiated.
This is part of cost function J from the Optimization objective class by Prof. Andrew Ng. but, i have used only sum of squared distance function to calculate the training error.
Hello @rmwkwok
The main purpose of this code and plot is determine seed value that provides lowest training error or sum of squared error (y-axis). We can notice discontinuities or sharp spikes at specific seed values (x-axis). Hence these patterns make it challenging to differentiate them effectively.
When we plot, we always only plot a finite number of data points. When it is finite, there is always a gap between two data points, but should these gaps be seen as a proof for discountinuity? When it is finite, depending on how dense the data points are, we might always see spikes for even continuous functions that carry at least a minimum or maximum point, so is that really a good way to conclude anything on continuity? I don’t say a function discontinuous by looking at a plot of finite number of points. I say any function of random seed as discontinuous and my reason is that, random seed is, by nature, a discontinuous domain.
Thank You! I agree with your response! However, Regarding sum of squared error function, Can we approximate it as a general series as below ? Where we can split this series into K distinct terms, right? If K is infinite, then series is considered an infinite series. If K is finite, then it is referred to as a finite series, right?
Instead of “approximation”, I think it is the exact form. It is a sum of k terms. As for the definition of series, since I am not a mathematician, it would be better for you to consult someone more knowledgeable in mathematics.
Thank you for providing equations for sum of k terms.
However, Most of these important “series” related problems like Taylor Series, Fourier Series, discontinuities and more are fundamental in engineering mathematics. So there should be no problem in understanding the mathematical equations. But i am not familiar with proof for cost function (J), i.e sum of k terms of the squared distance error function, So that’s why I used the term “approximation”.
In the following example, let’s consider the dataset X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], when K =3, initialilized centroids using random seed of 67, we got the initial centroids at [[1 3] [7 7] [0 1]]. The final cluster assignments were idx: [2 0 0 1 1 1 1 1 1 1 1 1 1 1], and final centroids were [[0. 1. ] [1. 2.5 ] [1. 2.5 ] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545]].
I was calculating the SquaredDistance error: 114.86363636363636 Most importantly, I want to understand why? There was highest training error of 114.86363636363636. Could it be that K-means algorithm is trapped in a suboptimal solution due to such seed (67) and initial centroid at [[1 3] [7 7] [0 1]] ?
Sum of squared distance is the sum of the squared distances between point and the point’s centroid. What proof do you want? You don’t need a proof for the formula of sum of squared distance because that’s by definition.
Besides, when we talk about approximation in maths context, we know it is not the exact form. It’s fine to be not familiar with something, but that doesn’t make it an approximation in maths context. Let’s use this special term in the same way as others would when discussing maths.