(i need some advice)I’m working on a RAG pipeline using GPT-4o-mini, and I’ve implemented a prompt-based guardrail system to verify whether generated answers are accurate, relevant, and fully grounded in the provided context.
The core idea is:
-
For each
(user_question, context, generated_answer)
triplet, the guardrail checks:-
Relevance to the question
-
Completeness of answer
-
Factual accuracy sentence-by-sentence
-
No hallucinations or invented facts
-
The prompt outputs a binary pass/fail (1
or 0
). While this works in many cases, I’m observing significant false positives (hallucinated or incomplete answers passing), and false negatives (over-rejecting answers that are technically grounded but phrased differently).
My Question:
Would fine-tuning a model like BERT (or RoBERTa) to act as a binary verifier (given context, question, and answer) be a more reliable long-term solution? Or would you recommend a different approach — e.g., NLI-based sentence-level verification or chain-of-thought prompting to improve consistency?
If fine-tuning is viable:
-
What’s a minimum dataset size you’d consider effective?
-
Would a few thousand manually reviewed samples be enough to get decent performance?