Human-human vs human-LLM agreement in evals - provide reference

Andras_Komaromi · December 22, 2023, 12:43am

In the ‘RAG Triad of Metrics’ video Anupam Datta refers to a finding in the research literature at 25:45. He says that when asking a set of humans to evaluate the output of a RAG application, the agreement among the humans is around 80%. He goes on to say that interestingly the human-LLM agreement is also in the 80-85% range. I am really curious to find the references to this literature and learn more about it. Can someone point me to a specific paper(s) that supports this result?

Thank you,
Andras

Topic		Replies	Views
Measuring Human-level Accuracy Introduction to Machine Learning in Production	3	610	October 27, 2021
Wk2 lab: comment incorrect in section 2.4 Generative AI with Large Language Models week-2	1	432	September 15, 2023
Structuring DL project week 1 questions Structuring Machine Learning Projects week-1	17	59	July 23, 2024
C1_W3: The Porblem with beating HLP as a proof of ML superiority Introduction to Machine Learning in Production	1	538	July 29, 2022
[Post-Event] Turbocharge Your RAG Applications with Powerful RAG Analytics AI Discussions event , ai-discussions , rag , galileo	3	211	February 23, 2024

Human-human vs human-LLM agreement in evals - provide reference

Related topics