RAG evaluation metrics score threshold and when to use each metric

adiaa84 · December 29, 2024, 7:26pm

Hi,

I use some evaluation metrics for RAG like context precision, context recall, context relevance, answer relevance, answer correctness, answer conciseness, answer similarity, ROUGE_N and BLEU.

Any proposals for good criteria for defining score threshold for each evaluation metrics?
Also when each metric is suitable (should I classify my questions and choose suitable metrics for each classification)?

Thanks

Deepti_Prasad · December 30, 2024, 12:13pm

Your query seems to be vague, what kind rag application are you creating using the evaluation metrics you have mentioned?

What were the results? and what different result are you looking for?

Classify questions? is it question-answer model? have you also implemented transformer model into it?

Please always give a brief elaboration of your reason behind your query and what you are seeking.

You actually have pretty much covered good amount of evaluation metrics related to text to vectorization model, but if your model creation is more complex, then try METEOR (Metric for Evaluation of Translation with Explicit ORdering) but this would be more helpful in translation model.

adiaa84 · December 31, 2024, 12:47pm

Ok, I’ll give more details.
I’m testing a chatbot (not a general purpose chatbot, but for a specific domain), I’m using the following evaluation metrics to validate chatbot answers: context precision, context recall, context relevance, answer relevance, answer correctness, answer conciseness, answer similarity, ROUGE_N and BLEU.

First step I made classification for possible questions (Fact-Based, Open-Ended, Yes/No, Instructional, etc.)

Then tried to select suitable metrics for each type of questions.

The last thing is that I want to set a score threshold for each metric so that we can consider that the question is acceptable.

Is that clearer now?

Thanks a lot

Deepti_Prasad · December 31, 2024, 2:34pm

and what are the results? after your metric evaluation and why are you looking for more metric evaluation?

if you want to use threshold score, did you try log perplexity, this should be one could way to predict sequence detection when it comes question and answer model as well as if inclusion of translation

Isaak_Kamau · January 6, 2025, 4:01pm

@adiaa84
You might find Ragas (open-source RAG evaluation tool) quite helpful.

adiaa84 · January 6, 2025, 6:39pm

@Isaak_Kamau Yes there’s Ragas, deepeval and other evaluation tools, but still all these tools generate a score, and you’re the one who need to define your acceptable score threshold for each evaluation metric

adiaa84 · January 6, 2025, 6:45pm

@Deepti_Prasad I’ll give more details to clarify my problem.
When I use evaluation metric like context precision, sometimes I have a low score because most of the retrieved chunks are irrelevant (only 5 out of 16 were relevant), although the answer was very good because the 5 relevant chunks were enough + the other 11 irrelevant chunks didn’t affect the answer negatively.
That made me think that I misuse this evaluation metrics.
That’s why I’m looking for more details on how and when to use each evaluation metric.

Hope that’s clear.

Deepti_Prasad · January 6, 2025, 7:02pm

i would like to see your codes before I can say if your metric evaluation was miscalculated. Honestly don’t want to speculate on what data you are working on. You can shate a link to your code file by personal dm

Topic		Replies	Views
Metrics QA LLMs Generative AI with Large Language Models week-2	1	365	October 30, 2023
Building and Evaluating Advanced RAG Applications - How is e.g. Groundedness actually calculated? Building and Evaluating Advanced RAG Applications	4	687	December 3, 2023
RAG question Building and Evaluating Advanced RAG Applications	2	256	December 28, 2023
How to set the context in RAG evaluation? Building and Evaluating Advanced RAG Applications	4	201	December 5, 2023
Metrics for evaluation AI Discussions	1	51	May 18, 2023

RAG evaluation metrics score threshold and when to use each metric

Related topics