RAG evaluation metrics score threshold and when to use each metric

Hi,

I use some evaluation metrics for RAG like context precision, context recall, context relevance, answer relevance, answer correctness, answer conciseness, answer similarity, ROUGE_N and BLEU.

Any proposals for good criteria for defining score threshold for each evaluation metrics?
Also when each metric is suitable (should I classify my questions and choose suitable metrics for each classification)?

Thanks

Your query seems to be vague, what kind rag application are you creating using the evaluation metrics you have mentioned?

What were the results? and what different result are you looking for?

Classify questions? is it question-answer model? have you also implemented transformer model into it?

Please always give a brief elaboration of your reason behind your query and what you are seeking.

You actually have pretty much covered good amount of evaluation metrics related to text to vectorization model, but if your model creation is more complex, then try METEOR (Metric for Evaluation of Translation with Explicit ORdering) but this would be more helpful in translation model.

Ok, I’ll give more details.
I’m testing a chatbot (not a general purpose chatbot, but for a specific domain), I’m using the following evaluation metrics to validate chatbot answers: context precision, context recall, context relevance, answer relevance, answer correctness, answer conciseness, answer similarity, ROUGE_N and BLEU.

First step I made classification for possible questions (Fact-Based, Open-Ended, Yes/No, Instructional, etc.)

Then tried to select suitable metrics for each type of questions.

The last thing is that I want to set a score threshold for each metric so that we can consider that the question is acceptable.

Is that clearer now?

Thanks a lot

and what are the results? after your metric evaluation and why are you looking for more metric evaluation?

if you want to use threshold score, did you try log perplexity, this should be one could way to predict sequence detection when it comes question and answer model as well as if inclusion of translation

1 Like

@adiaa84
You might find Ragas (open-source RAG evaluation tool) quite helpful.

@Isaak_Kamau Yes there’s Ragas, deepeval and other evaluation tools, but still all these tools generate a score, and you’re the one who need to define your acceptable score threshold for each evaluation metric

@Deepti_Prasad I’ll give more details to clarify my problem.
When I use evaluation metric like context precision, sometimes I have a low score because most of the retrieved chunks are irrelevant (only 5 out of 16 were relevant), although the answer was very good because the 5 relevant chunks were enough + the other 11 irrelevant chunks didn’t affect the answer negatively.
That made me think that I misuse this evaluation metrics.
That’s why I’m looking for more details on how and when to use each evaluation metric.

Hope that’s clear.

i would like to see your codes before I can say if your metric evaluation was miscalculated. Honestly don’t want to speculate on what data you are working on. You can shate a link to your code file by personal dm