How to evaluate Agents

Edu4rd · June 5, 2024, 6:50pm

Hi there,

we have been implementing and learning about agents on AutoGen, LangGraph, CrewAI or just LangChain. We also know how to evaluate RAG but how can we evaluate Agents? Selecting the right tools, answer correctness, etc… Any ideas, resources, trainings to explore? When building a GenAI solutions with agents using tools I think we need a set of different tools or solutions to test components and end to end, don’t you?

Thank you!
Eduard

bwhiting2356 · June 5, 2024, 10:37pm

I’m not an expert here, but my approach would be:

You have a list of test cases - these could be tasks you want your agent to be able to accomplish
You run your agent script on each test case. At first, you can manually inspect the results and make a judgement on how well it accomplished the task. Maybe you give it a score or a binary pass or fail. You can also automate the evaluation with another LLM call that looks at the result and compares it to what you expect the result to be (in my experience, LLMs do pretty well at this kind of evaluation).
You could calculate the average score or a percent success rate, and maybe you want your agent to keep above some threshold on your test cases as you iterate.
If you’re automating maybe you use something like pytest and parameterized tests
You may want to mock out any tools you’re using - for example, if one of your tools is a web search, you may want to return a static result rather than having the tool query the web fresh each time. Otherwise it might be trickier to debug why your agent is returning different answers.

Martin_Pollard · June 9, 2024, 1:44pm

Hi Edu4rd!

LangGraph certainly looks as though it has some great ways to evaluate how your agents are tracking, but if you’re looking for something more in-depth, you might want to give LangSmith a try. I’ve used it at various points for evaluating LangChain-based AI tools, so for agentic frameworks built using it as a base it should be a good start.

Otherwise I’d probably just build a separate evaluation agent team.

Topic		Replies	Views
Agents Evaluation Building Generative AI applications with Gradio	0	149	June 5, 2024
🌟 New Course! Enroll in AI Agents in LangGraph! News and Announcements short-course	12	913	July 8, 2024
Logging Differences Between LangChain's AgentExecutor and LangGraph Components AI Agents in LangGraph	0	95	December 6, 2024
AI Multi-Agents AI Discussions ai-discussions	0	143	April 23, 2024
Lesson 2: Tools not getting executed AI Agents in LangGraph	0	92	July 14, 2024

How to evaluate Agents

Related topics