Guidance Needed: Improving Guardrail Evaluation in RAG System (GPT-4o-mini Use Case)

(i need some advice)I’m working on a RAG pipeline using GPT-4o-mini, and I’ve implemented a prompt-based guardrail system to verify whether generated answers are accurate, relevant, and fully grounded in the provided context.

The core idea is:

  • For each (user_question, context, generated_answer) triplet, the guardrail checks:

    1. Relevance to the question

    2. Completeness of answer

    3. Factual accuracy sentence-by-sentence

    4. No hallucinations or invented facts

The prompt outputs a binary pass/fail (1 or 0). While this works in many cases, I’m observing significant false positives (hallucinated or incomplete answers passing), and false negatives (over-rejecting answers that are technically grounded but phrased differently).

My Question:

Would fine-tuning a model like BERT (or RoBERTa) to act as a binary verifier (given context, question, and answer) be a more reliable long-term solution? Or would you recommend a different approach — e.g., NLI-based sentence-level verification or chain-of-thought prompting to improve consistency?

If fine-tuning is viable:

  • What’s a minimum dataset size you’d consider effective?

  • Would a few thousand manually reviewed samples be enough to get decent performance?