Need help with kaggle dataset. Please help!

Arisha_Prasain · November 27, 2022, 1:30pm

I have only completed the MLS Specialization and I am trying to solve problems on Kaggle. I am trying to use K-Means Algorithm on Spotify dataset but when I was trying to calculate the summed squared error with 10 or 20 clusters, the cost was too big. I think the problem is on my preprocessing but I don’t know what to do. Why did this happen?

But, lets say that my costs were in the decreasing order and I found out the optimal no. of clusters through the Elbow Method, how would I visualize that in matplotlib? I have done by keeping 2 features from original data set on x, y axes and putting hue = the prediction from K-Means algorithm. What would be other ways to do it?
Also, my plot doesn’t make much sense, does it?

Dataset :

My notebook:
k-means-clustering.py (2.7 KB)
)

the_sophic · January 1, 2023, 5:20pm

Hi @Arisha_Prasain

If you wish to visualize the optimal number of cluster that you found out through the elbow method, you could try the following :

from yellowbrick.cluster import KElbowVisualizer
visualizer = KElbowVisualizer(KMeans(), k=(2,<max_clusters>))
visualizer.fit(X)
visualizer.show()

The above code will show you something like this :

Hope this helps.

Cheers,
Abhishek

graham_broughton · January 15, 2023, 12:31pm

I don’t know what the data looks like, but for KMeans it is important that you scale the data first. It makes sense if you think about it intuitively… Since KMeans is Euclidean distance, if your features are not on the same scale it could end up being pretty hard to calculate the distance between the data points and centroids. Scaling it with StandardScaler or whatever tool brings all the data points closer together.

For your plotting question, you definitely have the right idea by using hue to differentiate the clusters. Basically all you have to do is join the KMeans predictions with your dataframe so you can plot your data however you want and simply pass the predictions column in hue/size/shape/whatever.

Topic		Replies	Views
Kmeans cost function Unsupervised Learning, Recommenders, Reinforcement week-1	5	22	December 16, 2024
Course 3 Week1 Lab Unsupervised Learning, Recommenders, Reinforcement week-1	11	666	October 21, 2022
Understanding K-mean clusters Unsupervised Learning, Recommenders, Reinforcement week-1	4	535	January 7, 2023
Course 3 week 1 : initializing K means vs choosing the right K Unsupervised Learning, Recommenders, Reinforcement week-1	3	479	April 12, 2023
How exactly does k means know that a cluster centroid is closer to these set of data sets? Unsupervised Learning, Recommenders, Reinforcement week-1	4	508	February 19, 2023

Need help with kaggle dataset. Please help!

Related topics