Need help with kaggle dataset. Please help!

I have only completed the MLS Specialization and I am trying to solve problems on Kaggle. I am trying to use K-Means Algorithm on Spotify dataset but when I was trying to calculate the summed squared error with 10 or 20 clusters, the cost was too big. I think the problem is on my preprocessing but I don’t know what to do. Why did this happen?

But, lets say that my costs were in the decreasing order and I found out the optimal no. of clusters through the Elbow Method, how would I visualize that in matplotlib? I have done by keeping 2 features from original data set on x, y axes and putting hue = the prediction from K-Means algorithm. What would be other ways to do it?
Also, my plot doesn’t make much sense, does it?

Dataset :

My notebook:
k-means-clustering.py (2.7 KB)
)

Hi @Arisha_Prasain

If you wish to visualize the optimal number of cluster that you found out through the elbow method, you could try the following :

from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(KMeans(), k=(2,<max_clusters>))
visualizer.fit(X)
visualizer.show()

The above code will show you something like this :
image

Hope this helps.

Cheers,
Abhishek

1 Like

I don’t know what the data looks like, but for KMeans it is important that you scale the data first. It makes sense if you think about it intuitively… Since KMeans is Euclidean distance, if your features are not on the same scale it could end up being pretty hard to calculate the distance between the data points and centroids. Scaling it with StandardScaler or whatever tool brings all the data points closer together.

For your plotting question, you definitely have the right idea by using hue to differentiate the clusters. Basically all you have to do is join the KMeans predictions with your dataframe so you can plot your data however you want and simply pass the predictions column in hue/size/shape/whatever.

3 Likes