Week #1 Topic: Unsupervised Learning: K-means Clustering: Objective of K-means Optimization

suresh359 · October 24, 2025, 10:17pm

Thank you for confirming that “K-means does not “always” converge the sum of squared distances to zero”. However, when K = 1 in the given example p1 = (1.0 1.0) p2 = (2.0 2.0) and µ1 = Final Cluster Centroid = (1.5, 1.5), then if we calculate the “sum of cosine distances” instead of “sum of squared distances” it converges to “zero”.

The figure shows the cosine similarity

For example, The cosine distance is defined as follows,

“cosine distance = 1 - cosine similarity(A,B) = 1 - cos (Θ)”, where Θ is the angular distance between vectors A and B.

If we consider 2 points:

For p1 = (1.0, 1.0) and µ1 = (1.5, 1.5)
cosine distance (p1 , µ1) = 1 - cosine similarity (p1 , µ1 ) = 1 - cos (0) = 1-1 = 0, where Θ is the angular distance is “0 degrees” between vectors p1 and µ1.
For p2 = (2.0, 2.0) and µ1 = (1.5, 1.5)
cosine distance (p2 , µ1) = 1 - cosine similarity (p2 , µ1 ) = 1 - cos (0) = 1-1 = 0, where Θ is the angular distance is “0 degrees” between vectors p2 and µ1.

Therefore the total sum of cosine distances can be calculated as follows,

Total sum of cosine distances = cosine distance (p1, µ1) + cosine distance (p1, µ1) = 0 + 0 = 0

rmwkwok · October 25, 2025, 12:41am

Hello @suresh359,

Yes, that’s an interesting idea. It seems to me you are good at looking into different possibilities! However, I still have to explain why that possibility is not the way out when we consider the statement “k-means converges to zero”.

1

I can also immediately propose another two-datapoint case that neither squared distance or your cosine distance is zero when k = 1. For example:

2

k-means algorithm does not specify which distance to use, but one thing for certain is that, for me to say “k-means converges to zero”, it has to be done all by k-means. It can’t be done by us changing the value of k, and it can’t be done by us changing how we measure distance. In other words, since k-means only adjusts the locations of centroid, we can say k-means converges to zero only when, given k, distance measurement and everything else fixed, k-means can always reduce the distance to zero by adjusting the locations.

Cheers,
Raymond

PS: Since k-means clusters datapoint by distance, if you changed to use cosine distance, you would also change the behavior of the result - is that behavior expected? I think making it zero isn’t everything about k-means, right? You need useful result as well, right?

suresh359 · October 26, 2025, 6:08pm

Hello @rmwkwok

In my analysis of the above example with K=1, I used different distance metrics such as Euclidean and Cosine. In both cases i observed that the K-means guarantees the cost to decrease or remain the same at each iteration. However, the cost function J is not always converges to zero. The K-means algorithm’s rules remains the same. But the final cluster identities and centroids results may changes, I think there may be a change in optimization objective of the K-means cost function. Is this acceptable?

As the K-Means algorithm does not specify which distance metric to use. I analyzed more about distance metrics for clustering. I reviewed the following journal “Engineering Applications of Artificial Intelligence” in the following section 6.3.1. Clustering similarity measures which outline common distance functions such as Euclidean, Cosine, Minkowski, Mahalanobis, Pearson correlation, and more. Here are the citation details: @article{EZUGWU2022104743,

title = {A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects},

journal = {Engineering Applications of Artificial Intelligence},
volume = {110},
year = {2022},
url ={A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects - ScienceDirect}

rmwkwok · October 26, 2025, 11:45pm

Hello @suresh359,

Yes, I agree with your latest analysis!

Cheers,
Raymond

suresh359 · October 27, 2025, 4:29pm

Hello @rmwkwok,

Thank You!

suresh359 · October 28, 2025, 11:51am

Hello @rmwkwok,

Please see the following example. When K=3, using the example input X = np.array([[0,1], [1,2], [1,3], [8,4], [9,3], [8,2], [8,8], [8,6], [5,8], [7,7], [4,7], [3,8], [2,9], [6,9]]). We can see that K-means algorithm can be affected by different random initializations . This is evident when we use the python code line np.random.seed(200) in the “kMeans_init_centroids” function.

This example demonstrates a lack of convexity as the K-means algorithm can become trapped in suboptimal solutions or local optima as seen with the random seed 200.

For example,

When randomseed=9, The lowest squred distance error was 44.1667.
When randomseed=42, The lowest squred distance error was 45.4667
When randomseed=123, The lowest squred distance error was 44.1667
When randomseed=200, The lowest squred distance error was 117.75, the K-means algorithm got stuck in a local optimum with a lowest squared distance error of 117.75.

Training Error vs Random Seed:

However when we use the k-means++ algorithm to choose initial cluster centroids for the K-means clustering algorithm, It avoided getting trapped in suboptimal solutions unlike random seed 200 which displays sudden spikes.

rmwkwok · October 29, 2025, 7:18am

Hello @suresh359, they are very good examples!

Cheers,
Raymond

suresh359 · October 29, 2025, 8:09am

Hello @rmwkwok Appreciate your wonderful questions. They provide great examples to understand the problems and solutions. Thank you for guidance.

rmwkwok · October 29, 2025, 12:28pm

You are welcome!

suresh359 · October 30, 2025, 7:47am

Hello @rmwkwok

Can we determine if the below series as shown in the example, the function f(Squared Distance, Random Seed) exhibits discontinuities or sharp spikes at specific random seeds? I think at these points, the values such as 102.99, 102.99, 114.86, 117.75, 115.36, 114.86, and so on, this function cannot be differentiated.

rmwkwok · October 30, 2025, 7:53am

What are the labels of x-axis and y-axis? The labels are not clear.

suresh359 · October 30, 2025, 8:21am

Hello @rmwkwok
x-axis: random_state or random_seed
y-axis: Training Error or The Sum of Squared Distance Function

This is part of cost function J from the Optimization objective class by Prof. Andrew Ng. but, i have used only sum of squared distance function to calculate the training error.

rmwkwok · October 30, 2025, 10:36am

Random seed is not a continuous dimension.

suresh359 · October 30, 2025, 4:19pm

Hello @rmwkwok
The main purpose of this code and plot is determine seed value that provides lowest training error or sum of squared error (y-axis). We can notice discontinuities or sharp spikes at specific seed values (x-axis). Hence these patterns make it challenging to differentiate them effectively.

plot:

rmwkwok · October 31, 2025, 12:21am

When we plot, we always only plot a finite number of data points. When it is finite, there is always a gap between two data points, but should these gaps be seen as a proof for discountinuity? When it is finite, depending on how dense the data points are, we might always see spikes for even continuous functions that carry at least a minimum or maximum point, so is that really a good way to conclude anything on continuity? I don’t say a function discontinuous by looking at a plot of finite number of points. I say any function of random seed as discontinuous and my reason is that, random seed is, by nature, a discontinuous domain.

Cheers,
Raymond

suresh359 · October 31, 2025, 6:57am

Hello @rmwkwok,

Thank You! I agree with your response! However, Regarding sum of squared error function, Can we approximate it as a general series as below ? Where we can split this series into K distinct terms, right? If K is infinite, then series is considered an infinite series. If K is finite, then it is referred to as a finite series, right?

rmwkwok · October 31, 2025, 10:21am

I would write it this way:

Instead of “approximation”, I think it is the exact form. It is a sum of k terms. As for the definition of series, since I am not a mathematician, it would be better for you to consult someone more knowledgeable in mathematics.

Cheers,
Raymond

rmwkwok · October 31, 2025, 10:23am

Or this way:

suresh359 · October 31, 2025, 5:04pm

Hello @rmwkwok

Thank you for providing equations for sum of k terms.
However, Most of these important “series” related problems like Taylor Series, Fourier Series, discontinuities and more are fundamental in engineering mathematics. So there should be no problem in understanding the mathematical equations. But i am not familiar with proof for cost function (J), i.e sum of k terms of the squared distance error function, So that’s why I used the term “approximation”.

In the following example, let’s consider the dataset X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], when K =3, initialilized centroids using random seed of 67, we got the initial centroids at [[1 3] [7 7] [0 1]]. The final cluster assignments were idx: [2 0 0 1 1 1 1 1 1 1 1 1 1 1], and final centroids were [[0. 1. ] [1. 2.5 ] [1. 2.5 ] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545]].

I was calculating the SquaredDistance error: 114.86363636363636 Most importantly, I want to understand why? There was highest training error of 114.86363636363636. Could it be that K-means algorithm is trapped in a suboptimal solution due to such seed (67) and initial centroid at [[1 3] [7 7] [0 1]] ?

rmwkwok · October 31, 2025, 11:47pm

Sum of squared distance is the sum of the squared distances between point and the point’s centroid. What proof do you want? You don’t need a proof for the formula of sum of squared distance because that’s by definition.

Besides, when we talk about approximation in maths context, we know it is not the exact form. It’s fine to be not familiar with something, but that doesn’t make it an approximation in maths context. Let’s use this special term in the same way as others would when discussing maths.

Topic		Replies	Views
Practical questions about K-means cost function Unsupervised Learning, Recommenders, Reinforcement week-module-1	4	409	March 25, 2024
Kmeans cost function Unsupervised Learning, Recommenders, Reinforcement week-module-1	5	41	December 16, 2024
C3_W1_Practice Quiz Unsupervised Learning, Recommenders, Reinforcement week-module-1	1	260	February 4, 2024
Week #1 Topic: Unsupervised Learning: Clustering: Choosing the value of K Unsupervised Learning, Recommenders, Reinforcement week-module-1	4	83	October 13, 2025
Question on C3_W1_KMeans_Assignment Unsupervised Learning, Recommenders, Reinforcement week-module-1	14	219	May 19, 2024

Week #1 Topic: Unsupervised Learning: K-means Clustering: Objective of K-means Optimization

1

2

Related topics