Using Recall, Precision and overall model accuracy when you have Supervised Learning Models and labeled data is great, but what about models with no labeled data, i.e. Unsupervised Clustering or Topic Analysis in texts, where the standar model performance metrics not always have a direct correlation with the model outcome label but sometimes with the statistical space of the model.
In clustering for example the silhouette score would give a sense on how well are records clustered within a group, but this may or may not correspond to an actual business group.
So my question is basically, what other line of thoughts or recommendations can be given in this scenarios, when labeled data is just not available?
You might benefit from reading Evaluation Metrics for Unsupervised Learning Algorithms
It at least contains ideas you could search on or consider further. The null hypothesis testing idea intrigues me: if the data is truly random your unsupervised learning should not be telling you it found structure.
My own two cents is that it is difficult to assess the results of one output (say k-means with only a single number of clusters) but you can compute and rank aggregate distance metrics between multiple outputs (eg k-means with 3 clusters vs 4 clusters)
In the YOLO algorithm for object detection, number of clusters is a hyperparameter in the Convolutional Neural Network architecture, so ultimately you can compare the impact on accuracy the unsupervised learning output has on downstream (supervised learning) processing. HTH