How to Evaluate a Self-Evaluating RAG System using DeepEval / RAGAS with a Golden Dataset?

Hi everyone,

I’ve built a Self-Evaluating RAG System using LangChain, ChromaDB, BM25 hybrid retrieval, query rewriting, and cross-encoder reranking, with LLaMA 3.3 70B via Groq as the LLM.

I’m trying to evaluate it using DeepEval with a golden dataset (QA pairs generated from my documents). Here’s my current setup:

Stack:

  • RAG: LangChain + ChromaDB + BM25 + CrossEncoder reranker
  • LLM: Groq (LLaMA 3.3 70B)
  • Evaluation framework: DeepEval v3.9.7
  • Custom LLM for evaluation: Gemini 1.5 Flash (via Google GenAI SDK)
  • Golden dataset: Generated using DeepEval Synthesizer from 25 Wikipedia .txt documents
  • Metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall

What I’ve done so far:

  1. Generated a golden dataset using DeepEval’s Synthesizer with a custom HuggingFace embedder and Gemini as critic model
  2. Built test cases by running each golden question through my actual RAG pipeline to get real actual_output and retrieval_context
  3. Running evaluation using DeepEval’s evaluate() function with a custom Gemini model

Problems I’m facing:

  1. DeepEval’s evaluate() times out after ~30 minutes when running 8 test cases in parallel with 4 metrics
  2. Getting occasional 500 Internal errors from Gemini API during evaluation
  3. Not sure if running evaluation one-by-one using metric.measure() instead of evaluate() is the right approach

Questions:

  1. What’s the recommended way to run DeepEval evaluation without hitting timeouts — should I use metric.measure() one by one or is there a way to configure the timeout in evaluate()?
  2. Is there a better open-source/free LLM choice for the evaluation judge model that’s more stable than Gemini for DeepEval metrics?
  3. Has anyone successfully used RAGAS instead of DeepEval for a similar setup? Would it be easier to integrate?
  4. Any tips on generating better quality golden datasets without using OpenAI (since I don’t have an OpenAI key)?

Any help would be appreciated. Thanks!

That is a great project Dhruv. Really cool work.

One of the specialized mentors here will help you out with the technical details and the errors you are getting soon so dont worry.

Good luck with it! :blush: