How should one interpret the TruLens eval that shows a high answer relevance, with high groundedness, but with a low context relevance?
I would guess that only a small amount of the retrieved context is actually relevant. This could mean that the LLM uses this relevant context to come up with an answer, leading to high groundedness and answer relevance. If, for example, only 1 out of 5 context chunks is relevant then the context relevance is low based on how this metric is calculated, which could be useful as an insight because you could try to make the retrieval more efficient.
@David_Hillmann - thanks for your response. Indeed, I was wondering how to understand the context relevance and groundedness scores.
If the answer is relevant and also well grounded in the context, as shown by the high answer relevance and groundedness scores, then either the context should be relevant (in which case it should have a high context relevance score) or only a small fraction (a small number of chunks) of the context should be relevant to the query (which could explain the low context score). With that idea, then should we focus our attention on improving the context retrieval or not? What would be gained by improving the context relevance score is to produce a more relevant answer (but the answer is highly relevant to the query already) using a smaller number of tokens? Is there a way we could remove the chunks with low context relevance scores before the synthesis step? Is this filtering out already by the llama index libs? Just wondering out aloud here …