Clustering methods for embedding vectors

Hi! I’m working on a project and am very new to the Natural Language Processing techniques. I have some high-dimensional embeddings generated from resumes and hope to group them. I found suggestions to use UMAP and t-SNE, but there are some warnings on their own websites. I’m seeking a reliable method that works with Python. Thanks!

Well, first you need to define “reliable”. :grinning_face: I phrased that as a joke, but it’s a serious question: how do you define success or “good enough”? What is your actual goal?

The topic of vector embeddings (both how to create them and how to use them) is discussed in some detail at several points in the courses offered here. For example, it is one of the major topics in DLS Course 5 Sequence Models. I’m sure it is also covered in the NLP specialization.

There are two standard metrics for evaluating the similarity between different embedding vectors: cosine similarity and Euclidean distance. I do not have personal experience with this beyond the material in the courses here. My suggestion would be to start by taking DLS C5 and learn what is taught there. That will at least give you a framework for interpreting the warnings on the two websites that you mention above.

2 Likes

Thank you very much for all the suggestions. :wink:

Please check tensorflow embedding projector as well. This shows how one can project embeddings to a lower dimensional space for visualization / as a preprocessing step for clustering.

1 Like