Which model should I use to create a classifier without labels?

I have a large dataset but insufficient labeled data. Each piece of data has been embedded in a 57x474 matrix. I want to classify them into different groups using deep learning, but I am unsure about which model or technique would be best for this task. Additionally, I am uncertain whether deep learning is the most suitable approach for solving this task.

An algorithm like K-Means will group your examples into clusters, based on the numerical similarity of their features. You have to iterate using different numbers of clusters until you get groupings that are logical for your specific task or dataset.

It’s not a deep learning method - because deep learning almost always requires labeled data.


Thank you very much! However, considering the data size of 57x574 metrics, as I known deep learning is better suited for learning more characteristics within the data. Can K-means be competent for this task? If I don’t have enough label data to train, can I use unsupervised learning or semi-supervised learning?

To be clear:
Does your data set have 57 examples with 574 features? Or does it have 574 examples with 57 features?

You can’t make much use of deep learning unless your data is labeled.

K-means is unsupervised learning.

Thank you for your patient explanation. I have a dataset consisting of 190,000 examples, with each example represented as a 57x574 matrix that can be flattened to a 1x32,718 vector.

I have a different question: If I have 190,000 examples, each comprising 574 vectors, how can I classify them without knowing the exact number of groups? My goal is to maximize the distance between groups while minimizing the distance within each group. Thank you for your assistance.

Do you have experience using K-Means?

No, I will learn and try it! Thank you so much!

You might want to review the meaning of terms classify versus cluster in this context. This public document from Google developer training has some explanations I find useful …

If you decide on cluster as the general approach, you can also choose from among several alternatives for similarity or distance measure. Some are introduced in the document linked above. My intuition is that the 2-D structure of your data might suggest a preferred metric but I haven’t thought that through completely.

Let us know what you come up with?


Indeed. Classification requires a labeled training set.

If you don’t have labels, all you can do is create clusters.

The members of a cluster may (or may not) be in the same class. You really don’t know unless you have labeled data and can use supervised learning.