Building and Evaluating Advanced RAG Applications - How is e.g. Groundedness actually calculated?

NYCDev · December 1, 2023, 6:57pm

I understand that the Groundedness, Answer Relevance and Context Relevance metrics are all returned by the custom functions inside the TruLens class. But can you give any more insight how these metrics are actually calculated? Are they all handed off to the LLM to return a quantitative score or is there a ROUGE/BLEU like calculation going on under the covers (or cosine similarity or Levenshtein or something else?)

Thanks!

David_Hillmann · December 2, 2023, 4:16pm

If I got it right, they ask the LLM to score each result between 0 and 10. Then the values are normalized (div by 10) and you take np.mean of the many comparisons to get a single score.

Based on: trulens github

NYCDev · December 2, 2023, 4:29pm

Thanks for the explanation. That’s what I thought. This, of course raises the question of whether or not to trust a ”black box” to return a score that can be credible, explainable, and/or interpretable.

vaga20 · December 3, 2023, 5:33am

Exactly, even i have same doubt. Evaluating an LLM response by using LLM. How can we quantify? Every iteration gives slightly different measures for same query.

I do understand it provides some meaning but i feel the need for some more info about this intuition.

David_Hillmann · December 3, 2023, 11:23am

Maybe the best way to think about is in the sense: “All models are wrong, but some are useful”. Even if his implementation is not ideal it might actually be valuable. And if you want more trust you can deepdive into each of those scores and check the sentences yourself. I would assume that on the high and low end the model is not wrong most of the time - in between I would not trust them and maybe check myself.

Topic		Replies	Views
RAG question Building and Evaluating Advanced RAG Applications	2	256	December 28, 2023
RAG evaluation metrics score threshold and when to use each metric AI Discussions ai-discussions	7	132	January 6, 2025
How to set the context in RAG evaluation? Building and Evaluating Advanced RAG Applications	4	201	December 5, 2023
Groundedness with Langchain Building and Evaluating Advanced RAG Applications	1	311	April 12, 2024
Embeddings, Vector DB, FAQs earch and ranking AI Discussions vector-database	4	155	June 7, 2024

Building and Evaluating Advanced RAG Applications - How is e.g. Groundedness actually calculated?

Related topics