Week #1 Topic: Unsupervised Learning: K-means Clustering: Objective of K-means Optimization

  • Week # 1.
  • Link to the classroom item you are referring to: K-means Clustering - Objective of K-means Optimization
  • Description: When i chose K = 3, max_iters = 20, I observed that K-means Algorithm consistently decreases the value of distortion with each iteration step but it eventually stuck at value of J=0.8889 and did not reduce further. In fact, even after different random initializations, it gives me the two distinct K-means solutions but in both cases it’s stuck at the same value of J=0.8889. Why did it end up with same value in both scenarios and failed to reduce further?
  • Random Initialization-1:

  • Random Initialization-2:

Hello @suresh359,

Was it from assignment? I am asking because I think the assignment didn’t compute the “cost” and it didn’t draw the cost curve either, so I think you added them yourself? If so, I need to know how you computed the cost. Was it the sum of squared distances between points and their centroids? If so, I think there is always a minimum cost that you can’t reduce further because, for example, if I wanted that squared distance to be zero, then I would need every distance to be zero which means that I would need every point to be at the same location as their centroids but this is not possible if I have so many points while I have just 3 centroids.

Therefore, for each value of k, there is just such a minimum cost that it can never be better than, but the question is what is that minimum cost? If you know the true labels, then obviously you can calculate it. Otherwise, I would try many different initializations like you did but more, and see what the minimum is.

Cheers,
Raymond

If both initializations eventually lead to the same cluster membership, then your “distortion” value will be the same in both cases.

Hello, Thanks for your reply. As explained in Optimization objective Class by Professor Andrew Ng, I was trying to understand the K-means algorithm and why the algorithm is trying to minimize this cost function (J) Or why minimize the distortion. to further my understanding, I updated the Lab Assignment code to include the additional cost function as discussed by the professor.

I observed the following facts, The K-means Algorithm is consistently decreases the value of cost function (J), with the lowest value of J representing as the lowest cost but it eventually stuck at value of J=0.8889 and did not reduce further. However, From my readings in academic literature, I learned that K-means converges to a Local Minimum (i.e zero) but does not Gurantee to convergence to a Global Minimum. This is because the cost function is non-convex and it contains many local minima and making it NP-Hard problem to find the global minimum.

There really isn’t a cost function for unsupervised learning, because there are no labeled examples.

Perhaps for this dataset, you might consider that maybe the value 0.8889 is the global minimum of the distortion value you’re computing.

Hello, Below are my two distinct initializations, Are you saying that same cluster membership means the following ?

For scenario 1: the initial centroids are,
Red Cluster: [1.15354031, 4.67866717]
Green Cluster: [5.50295759, 2.62924634]
Blue Cluster: [2.02358978, 0.44771614]]

For scenario 2: the initial centroids are
Red Cluster: [1.24792268, 4.93267846]
Green Cluster: [5.05274526, 2.75692163]
Blue Cluster: [6.62447253, 2.74453743]]

What matters is the values for the final centroids, not the initial centroids.

Note that the cluster designations for “red”, “green” and “blue” don’t matter. They’re arbitrary. This isn’t a labeled dataset.

What matters is which examples are associated with each cluster.

It’s also possible that there may be an error in your code for computing the “distortion”. I have not inspected your code.

Thank You! Thank you! The distortion function i.e For each point from i = 1 to m, clusters(c) = 1 to K, the cost function J(c1, c2, …, cm, mu1, mu2, …, muK) as the sum of squared differences: J = (1/m)*sum(||[X(i) - mu(c)(i)]||^2) is not applicable to unlabeled examples in unsupervised learning. How to determine whether the value of J=0.8889 is a Global Minimum instead of a Local Minimum of the distortion function for the given training set examples {𝑥(1),…,𝑥(𝑚)}{x(1),…,x(m)} ?

In general, there is no mathematical method to to prove whether a given minimum is also the global minimum.

Do you have a link for this literature? I would like to read it.

I am not convinced this statement is correct. In general, the sum of the squared errors will be convex, because the 2nd partial derivative of a quadratic sum is always positive or zero. So there are no maxima, and hence no local minima.

Hello, Please find the Academic Reference literature, Source: Book, Author: PETER, FLACH, Page # 249, Title: Machine Learning: The Art and Science of Algorithms that Make Sense of Data provides commentary on “In General, While K-means converges to a stationary point in finite time, no guarantees can be given about whether the convergence point is in fact the global minimum, or if not, how far we are from it.“

Here is another video reference on the topic:

Statistical Learning: 12.3 k means Clustering

: https://youtu.be/ded_NQqOe7I?si=sKOc_j3cBQGkoMfA

Thanks for the video link.

It’s an important point, that for a given set of final centroids, the distortion function (the sum of the squares of the differences) is convex.

But different centroid sets may give a different minimum distortion value.

There’s no general resolution to finding the lowest possible distortion. You just have to try lots of different initial centroids, run the algorithm, and compare the different final distortion values.

In general practice, we really don’t necessarily care about finding the global minimum - we just have to find a minimum that gives “good enough” results for the problem we’re trying to solve.

I would like to gain a deeper understanding of the following topic. Do you have any sources for the reference literature?

“In general, the sum of the squared errors will be convex, because the 2nd partial derivative of a quadratic sum is always positive or zero. So there are no maxima, and hence no local minima”

“It’s an important point, that for a given set of final centroids, the distortion function (the sum of the squares of the differences) is convex.”

I don’t have any references, but a textbook on differential calculus would cover the use of 2nd partial derivatives. I last studied it several decades ago.

Fine. Thanks!

Hello, @suresh359,

I think there are 2 points that we may want to pay attention to:

1

I read page 249 of Flach’s book: Machine Learning: The Art and Science of Algorithms that Make Sense of Data and it does not say “zero” as the value of any local minimum.

To disprove something, we only need one counter-example. Therefore, to disprove “zero”, we can look at the following counter-example when there are two data points and k is set to 1.

In this case, the sum of squared distances can never be zero no matter where we put the centroid at. In other words, there is local minimum, but that local minimum cannot always be zero.

2

I have searched the book on K-means (from page 247 to page 252) for the word “convex” and it never shows up.

I think there is a reason that we might not want to discuss “convex” in K-means.

“Convex” is a geometric idea, so when we say “convex”, we need to know it in our mind that what space we are thinking about.

For example, for

, the loss is convex in the space of \hat{y}.

Another example, for linear regression, which is

, the loss is convex in the space of w and b.

You see, the full description should be "the loss is convex in the space of some dimensions

Now, what is the full description on convex for k-means? What are the dimensions that the loss is (or is not) convex in?

For linear regression, we choose w and b because they are the trainable parameters. For K-means, what are the trainable parameters? In other words, what are the things that are being trained to change? I am going to write them below but as a workflow:

Now, I want to emphasize that the thing that we calculate the sum of squared distance (data point’s cluster ID) is not a continuous dimension. If it is not a continuous dimension, we cannot differentiate it. To prove the loss convex along a dimension, we show the 2nd derivative (with respect to that dimension) positive at the location that the 1st derivative is zero. However, because now it is not differentiable at all, we cannot prove it this way with differentiation.

In fact, as I said before, to disprove, you only need a counter-example, and so the book actually tried to deliver an example which is on page 249.

In conclusion, I would like to suggest that, when you study K-means,

  1. to disprove it is convex, you need only a counter-example;
  2. there is no need to use calculus or derivative of any function you call it loss to see if it is convex. “Convexity, calculus, derivatives” are concepts usually bounded together in many discussions, but for K-means, I don’t really think the set is relevant.

Cheers,
Raymond

Hello, Thanks for the explanation. I found word “convex” from the following video lectures on the topic:

  1. Statistical Learning: 12.3 K-Means Clustering: https://youtu.be/ded_NQqOe7I?si=sKOc_j3cBQGkoMfA
  2. K-Means, GMM (non-EM), Expectation Maximization I 2022 I Lecture 12: https://www.youtube.com/watch?v=LMZZPneTcP4&t=1176s
  3. Machine Learning 13 - K-Means: https://www.youtube.com/watch?v=5-Fn8R9fH7A
  4. Machine Learning | Lecture 16 - K-Means, GMM, and EM: https://www.youtube.com/watch?v=LmpkKwsyQj4&t=315s

I referenced the terms such as “global minimum”, “total within-cluster scatter”, “Stationary points in clustering”, “example 8.5” and “optimal solution” are from the following from the following Academic literature Source: Book, Author: PETER, FLACH, Page # 247-252, Title: Machine Learning: The Art and Science of Algorithms that Make Sense of Data.

From Example 8.5: (Stationary Points in Clustering):
I found an example task of dividing the set of numbers {8,44,50,58,84} into two clusters (K = 2). This example helped to demonstrate the K-means algorithm converges to a stationary point in finite time and can obtain optimal clustering. specifically it minimises the total within-cluster variation. But no guarantee that the convergence point is the global minimum or, if it is not, how far away it is from that minimum.

I computed the K-means algorithm using example task of dividing the set of numbers {8,44,50,58,84} into two clusters K=2, which resulted the optimal clustering of {8} and {44, 50, 58, 84} effectively minimizing the total within-cluster scatter.


1 Like

Hello @rmwkwok, Regarding Topic 1 (Edited),

In case when K = 1, when we place the “Centroid” at [[1.5 1.5]], K-means algorithm minimizes the sum of squared distances to lower value 1.0. But this does not provide the optimal solution of “Zero”.

In case when K=2, i.e 2-means and here K = m, where m =2 samples for given data [[1.0 1.0 ]] [[2.0 2.0] ] If we Initialize Centroids to µ1 =[[2.0 2.0 ]], µ2 =[[1.0 1.0 ]] or µ1 =[[1.0 1.0 ]], µ2 =[[2.0 2.0 ]] the sum of squared distances or Intertia is “Zero”. Does this make an optimal solution for verifying that the sum of squared distances or inertia is zero? Below is my understanding for your consideration.

Hello @rmwkwok, I have another viewpoint to share: Please consider that the K=1 and K=2 as edge test cases since we have only m=2 data points: [[1.0, 1.0] [2.0, 2.0]].
In this special check when K=2, a “Zero” for the sum of squared distances is meaningful result as we are working only with 2 data points.

If you have the same number of clusters as examples, the solution is trivial and provides no insight.

Hello @suresh359,

Let me summarize some of your observations:

When k=1, k-means never gets us to a total squared distance of 0.
When k=2, it is possible that k-means can get us 0.

Given these, I will still say that k-means does not always converge the sum of squared distances to zero. The reason is simple: k is not adjustable by k-means. k-means only adjusts the locations of centroids, and for me to agree that k-means converges to zero, it has to do so by adjusting the locations. Because, when k=1, k-means cannot move the centroid to anywhere that makes it zero, I say “K-means does not always converge the sum of squared distances to zero”.

k=2 can reach zero only because we have the same number of centroids as datapoints, however, we will never use k-means like that. In practice, we use less number of centroids than datapoints, and as a consequence, it cannot reach zero.

Cheers,
Raymond