Hi everyone,
I’ve built a Self-Evaluating RAG System using LangChain, ChromaDB, BM25 hybrid retrieval, query rewriting, and cross-encoder reranking, with LLaMA 3.3 70B via Groq as the LLM.
I’m trying to evaluate it using DeepEval with a golden dataset (QA pairs generated from my documents). Here’s my current setup:
Stack:
- RAG: LangChain + ChromaDB + BM25 + CrossEncoder reranker
- LLM: Groq (LLaMA 3.3 70B)
- Evaluation framework: DeepEval v3.9.7
- Custom LLM for evaluation: Gemini 1.5 Flash (via Google GenAI SDK)
- Golden dataset: Generated using DeepEval Synthesizer from 25 Wikipedia .txt documents
- Metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall
What I’ve done so far:
- Generated a golden dataset using DeepEval’s Synthesizer with a custom HuggingFace embedder and Gemini as critic model
- Built test cases by running each golden question through my actual RAG pipeline to get real actual_output and retrieval_context
- Running evaluation using DeepEval’s evaluate() function with a custom Gemini model
Problems I’m facing:
- DeepEval’s evaluate() times out after ~30 minutes when running 8 test cases in parallel with 4 metrics
- Getting occasional 500 Internal errors from Gemini API during evaluation
- Not sure if running evaluation one-by-one using metric.measure() instead of evaluate() is the right approach
Questions:
- What’s the recommended way to run DeepEval evaluation without hitting timeouts — should I use metric.measure() one by one or is there a way to configure the timeout in evaluate()?
- Is there a better open-source/free LLM choice for the evaluation judge model that’s more stable than Gemini for DeepEval metrics?
- Has anyone successfully used RAGAS instead of DeepEval for a similar setup? Would it be easier to integrate?
- Any tips on generating better quality golden datasets without using OpenAI (since I don’t have an OpenAI key)?
Any help would be appreciated. Thanks!