Seeking Advice: LLM Struggles to Find Contradictions Across Multiple Documents

zhucebox · August 29, 2025, 9:39am

Hi everyone,

I’m hoping to get your thoughts on a problem I’ve encountered with a project.
A small part of our project involves checking one or more documents for inconsistencies. For example, if a document states that “milk tea costs $10 a cup” in one place, and the same or another document says, “spend $20 to buy a cup of milk tea,” that’s a contradiction. Because these conflicting statements aren’t always phrased identically, we need to use an LLM to understand the semantics to identify them.
We’ve been passing the documents directly to a model (like ChatGPT or Gemini) via their API for it to read and assess.
We’ve noticed that if there’s only one document to check, the results are pretty good. The model is generally able to find the contradictory points with reasonable accuracy.
However, when we increase the number of documents to two or three, the model’s performance drops significantly. The inconsistencies could be within the same document or across different ones. In these multi-document scenarios, the model seems to get confused and can’t reliably identify the contradictions.
Based on my description, does anyone have an idea what might be causing this issue? What steps could we take to enable the model to accurately find contradictions across several documents? (I’m wondering if, in this situation, instead of feeding the entire documents to the large model, we should be using a method similar to RAG by chunking and vectorizing the documents first. However, this is different from a typical RAG use case, as we’re not searching for specific information based on a user query.)
Thanks in advance for your help!

Anuj_sharma3 · August 29, 2025, 11:07am

I am not expert but replying to build my understanding as well with certain assumption of my based on my experience of chatgpt.

a) First of all with my API key on free account I am allowed on chatgpt-4o-min which possibly is the only model I can use. Also these model i model have 20-30 neural layers were as the advanced model though not disclosedd upwards 100 nerual layer to process such issue at very find layer.

These smaller and older model also truncate and dont process the files. Also there is an option to think hard but i guess that is in grok but I am whehter it means it take a detailed answer or process information at higher level

If you can chose other advanced model ( payed) then may be you might get better response.

b) Also if if you can add system prompt to add. In my head i think this is classfication issue . In one of the tutorial i thought a encoder tranformer is better suited for it.

c) Also if you can add temperature, and try a lower value during prompting which the chatgpt allow . To my understanding if you keep the temperature value between 0 to 0.3 model would not make guesses or go wild.

I would love hear from expert and correct me if my understanding is off.

Topic		Replies	Views
Contradish tests whether AI models provide consistent answers when given semantically equivalent prompts rather than just measuri AI Discussions ai-discussions	0	17	March 24, 2026
Document Comparison LangChain: Chat with Your Data	5	2171	April 15, 2024
L4 - Q&A: different response LangChain for LLM Application Development	4	181	December 26, 2024
Seeking Advice: Integrating LLM with Large Local Document Databases AI Discussions ai-discussions	8	6613	January 28, 2025
Interactive query-based multi-document scientific texts summarization Generative AI with Large Language Models	1	306	September 10, 2023

Seeking Advice: LLM Struggles to Find Contradictions Across Multiple Documents

Related topics