I understand that the Groundedness, Answer Relevance and Context Relevance metrics are all returned by the custom functions inside the TruLens class. But can you give any more insight how these metrics are actually calculated? Are they all handed off to the LLM to return a quantitative score or is there a ROUGE/BLEU like calculation going on under the covers (or cosine similarity or Levenshtein or something else?)
If I got it right, they ask the LLM to score each result between 0 and 10. Then the values are normalized (div by 10) and you take np.mean of the many comparisons to get a single score.
Thanks for the explanation. That’s what I thought. This, of course raises the question of whether or not to trust a ”black box” to return a score that can be credible, explainable, and/or interpretable.
Exactly, even i have same doubt. Evaluating an LLM response by using LLM. How can we quantify? Every iteration gives slightly different measures for same query.
I do understand it provides some meaning but i feel the need for some more info about this intuition.
Maybe the best way to think about is in the sense: “All models are wrong, but some are useful”. Even if his implementation is not ideal it might actually be valuable. And if you want more trust you can deepdive into each of those scores and check the sentences yourself. I would assume that on the high and low end the model is not wrong most of the time - in between I would not trust them and maybe check myself.