I would like to know what are the best metrics to evaluate Question Answering LLMs and if there is any available example.
Thanks in advance.
Hi Laura,
Here are some metrics for evaluating QA systems:
F1 Score
Exact Match (EM): checks if the predicted answer is exactly the same as the ground truth answer.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): measures the overlap of n-grams between the generated text and a reference text.
BLEU (Bilingual Evaluation Understudy): can be used to some extent in QA to measure how many words and phrases in the model’s output overlap with a reference output.
Hugging Face Transformers also provide built-in evaluation scripts for QA tasks. You can use these scripts as examples to evaluate your models
Meteor: an advanced metric that tries to improve upon BLEU by considering synonyms and stemming
Depending on your specific use case and domain, you might also want to develop custom metrics that accurately reflect the performance of your QA system.
Hope this helps!