Week #1 Topic: Unsupervised Learning: K-means Clustering: Objective of K-means Optimization

Hello @rmwkwok
I was referring to context of “cost function (J)” formula for K-means. I understood the “optimization objective” session by Prof. Andrew Ng, The formula demonstrated by the definition and that’s Okay.

There is no cost function for K-Means, because none of the examples are labeled.

Hello @suresh359,

The formula itself requires no proof, because it is just the mathematical translation of the English sentence “sum of squared distance between point and the point’s centroid”. It’s pretty much like translating “half price of a $30 book” into “$30 x 0.5”. We don’t need a proof for “$30 x 0.5”, right?

Instead, are you looking for a proof that K-means can converge? This is a very important question, because if you are looking for that, the formula itself wouldn’t lead you there. Let me know.

Cheers,
Raymond

Hello @TMosh

In the lecture by prof. Andrew Ng on “optimization objective”:

K-Means algorithm is optimizes a specific cost function. This means that K-means algorithm trying to minimize the cost function or distortion as much as possible as displayed in the following plots.

I think we can also compute the cost function (J) for K-means, i.e We can compute

  1. the index of the cluster assignments c^(i) for given example X^(i).
  2. mu_k - cluster centroid k
  3. mu_c^(i) - cluster centroid of cluster to which example x^(i) has been assigned

For example, Let’s consider the given dataset, X^(i)= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], K=3, and m =14 examples,

We can compute the cluster identities such as c(i): idx: [2 0 0 1 1 1 1 1 1 1 1 1 1 1]
mu_c(i): [[0. 1. ] [1. 2.5 ] [1. 2.5 ] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545 [6.18181818 6.45454545] [6.18181818 6.45454545] [6.18181818 6.45454545]]
and
mu_k such as
mu_0= [[1. 2.5 ]
mu_1= [6.18181818 6.45454545]
mu_2= [0. 1. ]]

Using these values we can compute cost function or Mean Squared Error J = 8.204545454545453 or the Squared Error i.e (J x m or 8.204545454545453 x 14 ) = 114.86363636363636

Hello @rmwkwok
That’s correct. Thank You!

Hello @rmwkwok
K-Means will always converge to a local minimum in a relatively short amount of time. However, we’re uncertain if this local minimum is a Global Minimum nor can we determine how far we are from it.

I understand that K-means results are highly dependent on the initialisation of the centroids.

To illustarte this, let’s consider following dataset:
X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], K=3, and m =14 examples,

I tested with 3 different cases and all of the below examples converge to a local minimum but its not guaranteed if this point is Global Minimum.

Case 1: K-Means Squared Distance Error per itertaion using Seed:67 and an unfortunate Random initialization with the centroids [[1 3] [7 7] [0 1]]), The K-means algorithm converged to sub-optimal solution with following iterations

  • Iteration 1: SquaredDistance Error: 126.0
  • Iteration 2: SquaredDistance Error: 114.86363636363636 ( k-means decreases the squared distance by 11.13636, indicating negative slope on squared distance error/iteration graph)
  • Iteration 3: SquaredDistance Error: 114.86363636363636 (no further improvement)

Note:

  1. Cluster 3 has only one data point [0, 1] is overlapping with its centroid, hence the squared distance is zero.
  2. Cluster 1 has two data points [1, 2] and [1, 3].
  3. Cluster 2 has remaining data points.


Case 2: K-Means Squared Distance Error per iteration using Seed:9 (random initialization with the centroids [[6 9] [8 6] [4 7]]). The K-means algorithm converged to the best optimal solution with following iterations:

  • Iteration1: SquaredDistance Error: 159.0
  • Iteration2: SquaredDistance Error: 85.25
  • Iteration3: SquaredDistance Error: 44.16666666666667
  • Iteration4: SquaredDistance Error: 44.16666666666667


Case 3: K-Means Squared Distance Error per iteration using Seed:67, k-means++ initialization of the centroids [[8 6] [1 2] [4 7]]. The K-means algorithm converged to an optimal solution with following iterations

  • Iteration 1: SquaredDistance Error: 59.0
  • Iteration 2: SquaredDistance Error: 45.46666666666667
  • Iteration 3: SquaredDistance Error: 45.46666666666667


1 Like

Interesting results!

Hello @rmwkwok

I was asking something like this, In the lecture by Prof. Andrew Ng on “K-means Algorithm” he explains that “When you implement this algorithm, you find that it’s actually a little bit more convenient to minimize the squared distance because the cluster centroid with the smallest square distance should be the same as the cluster centroid with the smallest distance.”

This concept is related to the idea that the average mean μ(k) of a set of data points X^(k) in a cluster k is the unique point that minimizes the sum of squared distances to those data points

for exampe to find this minimum, we can take gradiant of the sum with respect to μ(k) and set it to zero. This approach allows us to demonstrate the following,

So you are showing that the centroid that minimizes the within-cluster sum of squared distances is computed by taking the mean of all data points in the cluster. It’s good!

I believe you are referring to “within-cluster sum of squared distances” as shown below ? I think this equation is proportional to variance (Var(X) = σ²). So can we also call this “within-cluster variance”?

I think there is a problem in this case, we can’t consider cluster-3 a valid cluster because it lacks at least 2 data points required to define coherent group. Also there is only one data point, the squared distance is zero and this does not provide reasonable way to guage how well data points grouped (quality of a clustering). Therefore, we can not conclude that it minimizes within-cluster sum of squared distances.

Yes.

Reasonable, but I have not seen others name it that way. I prefer a name that’s relevant to the context and well-known.

Isn’t zero the minimum?

Hello @rmwkwok

That’s great question. I think it is a good cluster as minimum is zero (The data point [0 1] is closer to its centroid than to other centroids [1.0 2.5] and [6.19 6.46]), which is much smaller than sum of squared distances of our entire dataset X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]] is measured at 114.87 where i noticed this as Local Maximum when K=3. However we do not have at least 2 points to asses the similarity or dissimilarity.

When we compare this with optimal solution, Where we noticed global minimum at a sum of squared distances of 44.17, the same cluster-3 formed by the following pairs {(0 1),(1 2),(1 3)}. I believe this makes more easier to evaluate whether they should be grouped together or not.

below figure shows a Local Maximum and Global Minimum in state space.

What do you mean? Similarity to what? How would you access it with two points? Can you give an example?

What is the x-axis?

Hello @rmwkwok

I used the below intution that X-axis indicates the initial centroids, because the results of the K-means algorithm is highly depend on the initialization of the centroids.

For example, if we consider dataset X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], It contains the n data points (n=14), If we want to choose k (k=3) centroids at a time, calculated the number of permutations as following,

I have iterated k-means algorithm through these 2184 centroids However i have plotted the below 4 solutions on the plot.

  1. [[8 2] [8 8] [2 9]] : 117.75
  2. [[8 2] [8 8] [6 9]] : 45.47
  3. [[8 8] [0 1] [1 2]] : 114.87
  4. [[0 1] [1 2] [1 3]] : 44.17

For example, Let us assume that we are assessing cluster-3, which is the optimal solution, where we measured the quality of clustering through the sum of squared distances of 44.17. If we consider 2 data points X = [1, 2], Y = [1, 3]. An intresting question arises, How can we measure similarity between these 2 data points ? One useful way to do this is by using cosine similarity, which calculates L2-normalized dot product of the 2 points. Cosine similarity reveals to us that the higher value means greater similarity between these points.

:thinking:

You use squared distance to group them in K-means, but you measure their similarity by their angular difference (which is what cosine similarity can measure).

What’s the purpose of this similarity measurement? To assess K-means result? Is that a fair assessment, given that squared distance and cosine similarity are two different measurements?

I see. So I guess you wanted to deliver the idea that some initial centroids result in a higher squared distance and some lower. I agree with this.

However, whether you see a “peak” (as local maximum) and a valley depends on how you arrange the ordering of those 4 solutions on the x-axis. For example, if we instead ordered them like below, then it will look decreasing. I think the ordering is pretty subjective and thus not conclusive.

  1. [[8 2] [8 8] [2 9]] : 117.75
  2. [[8 8] [0 1] [1 2]] : 114.87
  3. [[8 2] [8 8] [6 9]] : 45.47
  4. [[0 1] [1 2] [1 3]] : 44.17
1 Like

Hello @rmwkwok
I am evaluating whether the data points within a cluster-3 show high intra-cluster similarity i.e points within a cluster are similar to each other. For example, the sample points X = [1, 2], Y = [1, 3] lack context. If we calculate Euclidean distance between these points, we find that d(X, Y) = 1. Since d(X, Y) > 0, we can conclude that X ≠ Y.

However, Without the context, is it acceptable to depend only on the Euclidean distance as a measure of similarity?

Therefore, I thought of inner products or dot products which defined as ⟨x, y⟩ = ∥x∥∥y∥ cos(α), which indicates that the similarity between data points. I have also noticed a connection between Euclidean distance and dot products.

The following formula shows that the Euclidean distance between points x and y decreases whenever the dot product x · y increases, suggesting that the dot product itself serves as a measure of similarity.

There are ways to measure how well clustering is done, so you may as well use them to evaluate a K-means results. For example, you may check out the Silhouette score.