docAnalyzer - chat with large PDF dataset

Hi Everyone,

I am building docAnalyzer.AI using the classic combo RAG & openai API. We differentiate
ourselves by allowing users to chat at the same time with large dataset of documents.

Happy to discuss anything related to this and advanced RAG techniques.


our stack : js agents & pipelines + Redis for vector + Svelte for front-end

Let me know how it does and what are the key challenges. Would be keen as we are looking at a similar solution.

There are obviously many challenges. Here are a few on top of my mind :

A) data extraction/storing: even without taking into account OCR, PDF has non structured data, so it’s a challenge to keep and store many important meta information (eg TOC, tabular structure etc). We use knowledge graphs with MD markup as low level data format

B) contextual user prompt interpretation and vector search query builder : the most efficient query to use to retrieve relevant extracts that will let LLM to be able to answer the prompt is much harder than people suspect. You need to take into account chat history, the type of question (depending of that, the pipeline to success is different). We have several steps/routing strategy in place, and some to do openai calls (which is a penalty for speed) and there is still lot’s of room for improvement in this domain.

C) vector search: this is challenging ops to do at scale

D) postprocessing final LLM answer: this part if a bit easier (e.g. changing page number link) but it’s hard to close all potential bugs since openAI output is rather unpredictable with a lot’s of edge cases.


Thanks. Is there a particular library to use for knowledge graphs for formatting. That one need to be solved first to create sort of knowledge based to build RAG for us

We don’t use any framework or lib for the whole architecture (i mean no LLM specific lib of course). I think the current libs I have seen are using the wrong kind of abstraction and just make things harder IMHO. Knowledge graph is just a data structure you can build with low level tools (in our case Redis & JSON).