Human-human vs human-LLM agreement in evals - provide reference

In the ‘RAG Triad of Metrics’ video Anupam Datta refers to a finding in the research literature at 25:45. He says that when asking a set of humans to evaluate the output of a RAG application, the agreement among the humans is around 80%. He goes on to say that interestingly the human-LLM agreement is also in the 80-85% range. I am really curious to find the references to this literature and learn more about it. Can someone point me to a specific paper(s) that supports this result?

Thank you,