Week #1 Topic: Unsupervised Learning: K-means Clustering: Objective of K-means Optimization

Hello @rmwkwok

For example, if we calculate the average silhouette score for sample dataset X= [[0 1] [1 2] [1 3] [8 4] [9 3] [8 2] [8 8] [8 6] [5 8] [7 7] [4 7] [3 8] [2 9] [6 9]], when K =3.
The Average silhouette score : 0.54
The silhouette analysis for K-Means Clustering, When K = 3:

The sample silhouette scores are as follows:

The Analysis of silhouette plot shows that shows that shape of cluster blocks mostly sharp and peaked, resembling a knife rather than a cigar shape. However these blocks do not have the uniform width as shown in the Silhouette (clustering) plot on the wikipedia page (Silhouette (clustering) - Wikipedia), is it due to the average silhouette score of the sample dataset is 0.54 which is just > 0.5 ?

1 Like

I’d say it’s because the dataset is tiny (only 14 examples).

Hello @TMosh

Okay. Thank You!

Nice work, and I agree with Tom.

If I were you, I would always focus on the score of each individual data point as it is the purpose of this Silhouette score.

However, if I also wanted to understand the shape (knife-like vs wiki’s one), I would experiment. For example, I could hand move the data points so that they disperse less around their respective true centers, and see what the shape will be like. Alternatively, I might use sklearn’s make_blobs to generate datasets from very packed (small cluster_std) clusters to very dispersed (large cluster_std) clusters and see how the shape changes. You might as well control n_samples and see how the number of samples may affect the shape. Having said these, I would always be careful not to easily build up some “stereotype” interpretations of these shapes, because, on the one hand, these interpretations might be difficult to verify with dataset of many dimensions (you are studying 2D data here), and on the other hand, again, this Silhouette score is about each individual data point.

Lastly, if I were you, I would find and study other clustering metrics on Google or in some literature reviews.

Good luck!
Raymond

Hello, That’s great!

1 Like

Hello @rmwkwok

Thank You for highlighting a valuable point. I think the example is an excellent representation of a two dimensional data set, despite of having only 14 sample points. Actually, It produces very interesting results. specifically when K=3 and seed=9, using the input X=[[0,1], [1,2], [1,3], [8,4], [9,3], [8,2], [8,8], [8,6], [5,8], [7,7], [4,7], [3,8], [2,9], [6,9]]. In this scenario, The K-means algorithm converged to the optimal solution, Resulted in a Training Error of 44.17.


We can also find very good insights regarding the sample silhouettes shown below in descending order:

  • 50% of the samples are characterized by a silhouette score greater than 0.5.
  • 29% of the samples have a silhouette score less than 0.5.
  • 21% of the samples are also characterized by a score greater than 0.5.

Cluster 0:

  • [5, 8] - 0.593 (s(.) > 0.5)
  • [3, 8] - 0.557 (s(.) > 0.5)
  • [6, 9] - 0.539 (s(.) > 0.5)
  • [4, 7] - 0.525 (s(.) > 0.5)
  • [2, 9] - 0.466 (s(.) > 0.25)
  • [8, 8] - 0.148 (s(.) < 0.25)
  • [7, 7] - 0.133 (s(.) < 0.25)

This Cluster 0 has a peak silhouette score of approximately 0.593, while the average silhouette width is 0.423, which is considered “weak” since it is over 0.25. that’s why the cluster 0 is quite stretched. The corresponding silhouette is sharp, meaning that average inter-cluster distance has a large variance and that this cluster has more assignments than the others.

Cluster 2:

  • [1, 2] - 0.838
  • [0, 1] - 0.793
  • [1, 3] - 0.757

This cluster resembles a rectangular shape. It has a peak silhouette score of around 0.838, with an average silhouette width greater than 0.75, which is considered as “strong” . This means that dataset contains at least one group of highly cohesive samples such as [[0, 1], [1, 2], [1, 3]], whose distances to any other points assigned to different clusters are relatively close. There is also a clear separation between adjacent clusters.

Cluster 1:

  • [9, 3] - 0.697 (s(.) > 0.5)
  • [8, 4] - 0.657 (s(.) > 0.5)
  • [8, 2] - 0.644 (s(.) > 0.5)
  • [8, 6] - 0.203 (s(.) < 0.25)

The average silhouette width for this cluster is 0.56, which exceeds 0.5, and it is considered “reasonable”. It has a peak silhouette score of approximately 0.697. However, there is one outlier: [8, 6] - 0.203, which is below 0.25.

1 Like

Hello @TMosh @rmwkwok

I don’t think that the small size of the dataset is an important concern in this context.
For example, if we consider the following example dataset X =[[0,0],[2,0],[10,0],[12,0]], Its having only 4 samples. This dataset produces excellent results, when K=2, The K-means Algorithm converged to an optimal solution, resulting in a cost function (J) of 1.0. In fact, 100% of the samples are characterized by a silhouette score that exceeds the threshold of 0.75, which demonstrates a high level of cohesion (strong).

We can draw useful insights from the sample silhouettes shown below in descending order:

Cluster 0:

  • [0, 0] - 0.81 > 0.75
  • [2, 0] - 0.78 > 0.75

Cluster 1:

  • [12, 0] - 0.81 > 0.75
  • [10, 0] - 0.78 > 0.75

We can see that clusters resembles a rectangular shape and has the same peak silhouette score of around 0.81. The individual sample silhouette score is greater than 0.75. This confirms that dataset contains all groups of highly cohesive samples, specifically the following clusters group {[0, 0], [2, 0]}, {[10, 0], [12, 0] }.

1 Like

Hello @rmwkwok

Using the following input example X = [[-2,1],[-2,3],[0,1],[0,3],[1,-2],[2,-3],[2,-1],[3,-2]], I determined the optimal value for K in k-means clustering, I practiced both elbow method and silhouette analysis. The Silhouette analysis showed that K =2 is optimal choice for all K values of 2,3,4,5,6, and 7. The average Silhouette score for K = 2 is 0.6106, which is greater than 0.5. In contrast, for K values of 3, 4, 5, 6, and 7, the average silhouette scores are all below 0.5.

In applying elbow method for K=2, The cost function (J) is 1.5, which is greater than the J values for K values of 3, 4, 5, 6, and 7.

In applying silhouette analysis, The average silhouette scores for each K value are as follows:

  • For K = 2: Average silhouette score = 0.6106 (> 0.5)
  • For K = 3: Average silhouette score = 0.3196 (< 0.5)
  • For K = 4: Average silhouette score = 0.0795 (< 0.25)
  • For K = 5: Average silhouette score = 0.0858 (< 0.25)
  • For K = 6: Average silhouette score = 0.0858 (< 0.25)
  • For K = 7: Average silhouette score = 0.0 (< 0.25)

In this scenario, Given the results, Is it reasonable to select K = 2 as the optimal number of clusters just based on silhouette analysis? Why is the “elbow method” considered so imprecise for choosing the optimal K value?

Since the results of Silhouette analysis is your reason, it is reasonable :wink:

If there was a single best way to choose k, we would have learned it in any book on clustering. Instead, the fact is that there is no single best way, so it is important that we know how each method works. I would recommend you to look into that as well, as you try different methods. For example, the Silhouette analysis takes into account the relations between a data point and each centroids, in contrast, the elbow method does not. The elbow method simply compare whatever “cost value” given. Now, this is a very simple qualitative discussion of these methods, but I suggest you to go into the details and read some literature reviews on clustering metrics. Try to find the answer yourself and come up with some hypotheses on your own, then test your hypotheses.

Cheers,
Raymond

Hello @rmwkwok
Thank you for your excellent suggestions.
In addition to Prof. Andrew Ng’s “Clustering” sessions, I have also referenced several resources to understand various evaluation metrics, Including the following.

Reference Literature:

  1. Flach, P. (Author). Machine Learning: The Art and Science of Algorithms that Make Sense of Data.
  2. Bonaccorso, G. (Author). Hands-On Unsupervised Learning with Python.
  3. Bonaccorso, G. (Author). Mastering Machine Learning Algorithms.
  4. Danka, T. (Author). Mathematics of Machine Learning.
  5. Bonaccorso, G., Fandango, A., & Shanmugamani, R. (Authors). Python: Advanced Guide to Artificial Intelligence.
  6. Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (Authors). Cluster Analysis, 5th Edition: Optimization Clustering Techniques.

My understanding of evaluation metrics for K-means clustering is that when Ground Truth Labels are not available for unsupervised learning, We can also consider other metrics such as Calinski-Harabasz index, Davies-Bouldin index, In addition to silhouette score and the Elbow method.

However in the following example, if we consider following,
Training data set: {[8,0],[44,0],[50,0],[58,0],[84,0]},
Random State: 200
Sum of squared distances of samples to their closest cluster center or Inertia = 1280.0, cluster centers: [[64. 0.] [26. 0.]]
Labels: [1 1 0 0 0]
the average silhouette score is only 0.1979
Individual sample silhouette values are as follows 0.357, -0.444, 0.125, 0.468 and 0.482

In this example, The individual silhouette value for the point [44,0] is -0.444 (negative value) indicating that this sample has been assigned to the incorrect cluster (cluster 0) because a([44,0]) > b([44,0]), which results in a negative difference i.e b([44,0]) - a([44,0]) is negative.

Based on this evidence, We can conclude that Clustering Quality is poor. In this scenario, Could the insights from silhoutte measure be helpful in reducing Sum of squared distances or Inertia to produce a more optimal clustering solution ?

In this example, we already know that there is an optimal solution based on the following plot, Which indicates association between Intertia (Sum of squared distances of samples to their closest cluster centroids) vs n_init (Number of times the k-means algorithm is executed with different centroid seeds).

Optimal Solution:

Cluster Centers: [[59. 0.] [ 8. 0.]]
Inertia = 932.0
Centroid Seed: 2
Labels: [1 0 0 0 0]

That’s up to you. What will you do given these insights?

Hello @rmwkwok
Thank You! Based on the insights obtained, i believe, We can take the following steps.

We can calculate the path from the given centroids [[64 0] [26 0]] to the optimal clustering Center or goal state [[59 0] [ 8 0]] and Labels: [1 0 0 0 0] resulting in a sum of squared distance of 932.0.

Also, Its clear that we can “move centroid 0” to “left” along the path from [64,0] to [59,0] [Distance (d): 5 points]. and, “move centroid 1” to left along path from [26,0] to [8,0] [Distance (d): 18 points].

After centroid adjustments, we can then recalculate the cluster labels to achieve optimal clustering results such as [1 0 0 0 0] and cluster groups {[8] [44,50,58,84]}

We can also compute the metrics to evaluate if the sum of squared distances remains 932.0 and Silhouette scores to determine if there is an improvement in clustering quality. The average silhouette score is 0.460 and Sample Silhouette Values: [0. 0.444 0.619 0.680 0.561].

I think, with this understanding of ground truth class assignments [1 1 0 0 0] of the sample Training Set : {[8,0],[44,0],[50,0],[58,0],[84,0]}, It is possible to determine some more metrics such as completeness score, homogeneity score, V measure and others.

For example, if we calculate the completeness score, we know that the datapoint [44,0 ] has been wrongly assigned to cluster 1. hence, we would expect a non-perfect score of 0.3315597072868288

But even if there is an improvement, it is not always feasible for you to observe where to hand move the centroids, right? Because you may deal with high dimensional data that you can’t visualize.

@suresh359, I just realized that we have been talking in this topic for a month and I hope you had fun in your self-learning. I would just like to tell you that it’s likely I am going to be away since the end of this month for some time, so let’s aim to close this topic by then.

So far, I have seen that you have explored a lot and you have been guiding yourself which is great, but I also observed that your experiments were mostly on small and 2 dimensional datasets. These datasets are wonderful as starter, but they also have limitations, for example, as I mentioned in my last reply, some action may not be feasible with higher dimensional data. I think we need both - simple dataset for basic understanding and complex ones for more practical hands-on experience like addressing problems you didn’t have with simple dataset and facing limitations that complex datasets have. We need both and we might want to switch between them from time to time, for example, work on the complex ones, get or brainstorm new ideas and test them with both the simple ones and complex ones, then back to work on the complex ones.

Cheers,
Raymond

Hello @rmwkwok

Thank You!
I agree with you. I think this example, involves a small dimension and small dataset hence silhouette distance calculations less complex. Hence, it was easy to verify that sample data point [44, 0] has been assigned to wrong cluster because a([44,0]) = 36 which is greater than b([44,0]) = 20. To handle imbalance, we can move centroids to left.

First, I don’t know if the squared distances goes down and Silhouette score goes up after the move. Second, you can’t always hand move a point, especially when dealing with high dimensional data. Therefore, I would not suggest the move.

If you had tried it in a complex dataset with many dimensions, you would have found this yourself. Being able to find problem yourself is very important for self-learning. I really suggest you to try on complex dataset and test the limitation of your approach.

Thank You. I agree with you that complex dataset with many features can be hard. However, i think simple data sets also help in analyzing verity of factors such as proximity, similarity, continuity, familiar configuration..etc to determine which segments of a dataset naturally group together. For example, consider the following problem at hand: X = np.array([[0,3],[1,2],[1,4],[4.5,5],[6,6],[6,4],[6,2],[6,0],[4.5,1]]), In this case, K-means identifies an optimal solution for 3 clusters with Inertia or Sum of squared distances of 8.0

We can learn different things from simple and complex datasets.

I am surprised that the topic of K-Means has generated such a lengthy conversation.
This is a very deep exploration of the topic. It is impressive.