How to evaluate Agents

Hi there,

we have been implementing and learning about agents on AutoGen, LangGraph, CrewAI or just LangChain. We also know how to evaluate RAG but how can we evaluate Agents? Selecting the right tools, answer correctness, etc… Any ideas, resources, trainings to explore? When building a GenAI solutions with agents using tools I think we need a set of different tools or solutions to test components and end to end, don’t you?

Thank you!

I’m not an expert here, but my approach would be:

  • You have a list of test cases - these could be tasks you want your agent to be able to accomplish
  • You run your agent script on each test case. At first, you can manually inspect the results and make a judgement on how well it accomplished the task. Maybe you give it a score or a binary pass or fail. You can also automate the evaluation with another LLM call that looks at the result and compares it to what you expect the result to be (in my experience, LLMs do pretty well at this kind of evaluation).
  • You could calculate the average score or a percent success rate, and maybe you want your agent to keep above some threshold on your test cases as you iterate.
  • If you’re automating maybe you use something like pytest and parameterized tests
  • You may want to mock out any tools you’re using - for example, if one of your tools is a web search, you may want to return a static result rather than having the tool query the web fresh each time. Otherwise it might be trickier to debug why your agent is returning different answers.

Hi Edu4rd!

LangGraph certainly looks as though it has some great ways to evaluate how your agents are tracking, but if you’re looking for something more in-depth, you might want to give LangSmith a try. I’ve used it at various points for evaluating LangChain-based AI tools, so for agentic frameworks built using it as a base it should be a good start.

Otherwise I’d probably just build a separate evaluation agent team. :slight_smile: