We have learnt some useful information about how to calculate metrics to evaluate LLM models for text summarisation and translation. I have a question: when is the model good enough? Is there a value e.g. 0.5 of BLEU/ HELM that the model is good enough to become a POC I can trust and present to the client? Or is it entirely subjective?
@fridaki keep in mind there is also the Perplexity score (a measure of word prediction against human generated documents) where lower is better.
“Good enough” depends on the required performance for a specific application.
I recommend the following structured approach:
Determining whether a model meets the required quality standards is a complex issue and does not have a universal answer in the form of a specific metric threshold. Instead, consider the following steps:
First, address these key questions:
What is the minimum quality level required to solve the business problem?
Which errors are critical, and which are acceptable?
What are the expectations of end users?
Second, it is important to:
Evaluate how much better your model performs compared to baseline solutions.
Combine automated metrics (e.g., BLEU, ROUGE) with human evaluation.
Conduct pilot testing with real users and gather their feedback.
A specific metric value (e.g., BLEU = 0.5) may be excellent for one project but insufficient for another. Therefore, defining success criteria through discussions with stakeholders and end users is crucial to ensuring that the model meets their actual needs.