Metrics QA LLMs

Laura_Fernandez_Bece · October 16, 2023, 8:05am

I would like to know what are the best metrics to evaluate Question Answering LLMs and if there is any available example.
Thanks in advance.

lawrence · October 30, 2023, 9:43pm

Hi Laura,

Here are some metrics for evaluating QA systems:

F1 Score

Exact Match (EM): checks if the predicted answer is exactly the same as the ground truth answer.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): measures the overlap of n-grams between the generated text and a reference text.

BLEU (Bilingual Evaluation Understudy): can be used to some extent in QA to measure how many words and phrases in the model’s output overlap with a reference output.

Hugging Face Transformers also provide built-in evaluation scripts for QA tasks. You can use these scripts as examples to evaluate your models

Meteor: an advanced metric that tries to improve upon BLEU by considering synonyms and stemming

Depending on your specific use case and domain, you might also want to develop custom metrics that accurately reflect the performance of your QA system.

Hope this helps!

Topic		Replies	Views
Advanced LLM evaluation techniques Generative AI with Large Language Models week-module-2	2	778	July 28, 2023
Week 2 - Fine-tuning and LLM Evaluation in practice Generative AI with Large Language Models week-module-2	1	431	July 27, 2023
RAG evaluation metrics score threshold and when to use each metric AI Discussions ai-discussions	7	161	January 6, 2025
ROUGE and BLEU metrics Generative AI with Large Language Models week-module-2	4	543	September 18, 2024
When to know LLM model is good enough? Generative AI with Large Language Models week-module-2	3	65	January 29, 2025

Metrics QA LLMs

Related topics