How to Evaluate a Self-Evaluating RAG System using DeepEval / RAGAS with a Golden Dataset?

Dhruv_06 · April 21, 2026, 11:13pm

Hi everyone,

I’ve built a Self-Evaluating RAG System using LangChain, ChromaDB, BM25 hybrid retrieval, query rewriting, and cross-encoder reranking, with LLaMA 3.3 70B via Groq as the LLM.

I’m trying to evaluate it using DeepEval with a golden dataset (QA pairs generated from my documents). Here’s my current setup:

Stack:

RAG: LangChain + ChromaDB + BM25 + CrossEncoder reranker
LLM: Groq (LLaMA 3.3 70B)
Evaluation framework: DeepEval v3.9.7
Custom LLM for evaluation: Gemini 1.5 Flash (via Google GenAI SDK)
Golden dataset: Generated using DeepEval Synthesizer from 25 Wikipedia .txt documents
Metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall

What I’ve done so far:

Generated a golden dataset using DeepEval’s Synthesizer with a custom HuggingFace embedder and Gemini as critic model
Built test cases by running each golden question through my actual RAG pipeline to get real actual_output and retrieval_context
Running evaluation using DeepEval’s evaluate() function with a custom Gemini model

Problems I’m facing:

DeepEval’s evaluate() times out after ~30 minutes when running 8 test cases in parallel with 4 metrics
Getting occasional 500 Internal errors from Gemini API during evaluation
Not sure if running evaluation one-by-one using metric.measure() instead of evaluate() is the right approach

Questions:

What’s the recommended way to run DeepEval evaluation without hitting timeouts — should I use metric.measure() one by one or is there a way to configure the timeout in evaluate()?
Is there a better open-source/free LLM choice for the evaluation judge model that’s more stable than Gemini for DeepEval metrics?
Has anyone successfully used RAGAS instead of DeepEval for a similar setup? Would it be easier to integrate?
Any tips on generating better quality golden datasets without using OpenAI (since I don’t have an OpenAI key)?

Any help would be appreciated. Thanks!

omarWael · April 21, 2026, 11:52pm

That is a great project Dhruv. Really cool work.

One of the specialized mentors here will help you out with the technical details and the errors you are getting soon so dont worry.

Good luck with it!

Topic		Replies	Views
C1M4: No feedback available for this exercise. for Part2, 4 and Part 5 says Retrieval Augmented Generation week-module-4 , dl-ai-learning-platform	5	22	May 5, 2026
Agents Evaluation Building Generative AI applications with Gradio	0	216	June 5, 2024
Evaluation data set size for Fine-tuning and RAG AI Discussions model-evaluation	4	106	August 13, 2024
RAG Assessment AI Discussions ai-discussions	0	39	February 19, 2025
Any way to use RAG Triad without OpenAI? Building and Evaluating Advanced RAG Applications	3	862	December 20, 2023

How to Evaluate a Self-Evaluating RAG System using DeepEval / RAGAS with a Golden Dataset?

Related topics