Seeking Advice: LLM Struggles to Find Contradictions Across Multiple Documents

Hi everyone,

I’m hoping to get your thoughts on a problem I’ve encountered with a project.
A small part of our project involves checking one or more documents for inconsistencies. For example, if a document states that “milk tea costs $10 a cup” in one place, and the same or another document says, “spend $20 to buy a cup of milk tea,” that’s a contradiction. Because these conflicting statements aren’t always phrased identically, we need to use an LLM to understand the semantics to identify them.
We’ve been passing the documents directly to a model (like ChatGPT or Gemini) via their API for it to read and assess.
We’ve noticed that if there’s only one document to check, the results are pretty good. The model is generally able to find the contradictory points with reasonable accuracy.
However, when we increase the number of documents to two or three, the model’s performance drops significantly. The inconsistencies could be within the same document or across different ones. In these multi-document scenarios, the model seems to get confused and can’t reliably identify the contradictions.
Based on my description, does anyone have an idea what might be causing this issue? What steps could we take to enable the model to accurately find contradictions across several documents? (I’m wondering if, in this situation, instead of feeding the entire documents to the large model, we should be using a method similar to RAG by chunking and vectorizing the documents first. However, this is different from a typical RAG use case, as we’re not searching for specific information based on a user query.)
Thanks in advance for your help!

I am not expert but replying to build my understanding as well with certain assumption of my based on my experience of chatgpt.

a) First of all with my API key on free account I am allowed on chatgpt-4o-min which possibly is the only model I can use. Also these model i model have 20-30 neural layers were as the advanced model though not disclosedd upwards 100 nerual layer to process such issue at very find layer.

These smaller and older model also truncate and dont process the files. Also there is an option to think hard but i guess that is in grok but I am whehter it means it take a detailed answer or process information at higher level

If you can chose other advanced model ( payed) then may be you might get better response.

b) Also if if you can add system prompt to add. In my head i think this is classfication issue . In one of the tutorial i thought a encoder tranformer is better suited for it.

c) Also if you can add temperature, and try a lower value during prompting which the chatgpt allow . To my understanding if you keep the temperature value between 0 to 0.3 model would not make guesses or go wild.

I would love hear from expert and correct me if my understanding is off.