docAnalyzer - chat with large PDF dataset

clb · February 2, 2024, 11:07am

Hi Everyone,

I am building docAnalyzer.AI using the classic combo RAG & openai API. We differentiate
ourselves by allowing users to chat at the same time with large dataset of documents.

Happy to discuss anything related to this and advanced RAG techniques.

Christophe.

our stack : js agents & pipelines + Redis for vector + Svelte for front-end

sreenireddy · February 7, 2024, 9:24pm

Let me know how it does and what are the key challenges. Would be keen as we are looking at a similar solution.

clb · February 8, 2024, 5:47am

There are obviously many challenges. Here are a few on top of my mind :

A) data extraction/storing: even without taking into account OCR, PDF has non structured data, so it’s a challenge to keep and store many important meta information (eg TOC, tabular structure etc). We use knowledge graphs with MD markup as low level data format

B) contextual user prompt interpretation and vector search query builder : the most efficient query to use to retrieve relevant extracts that will let LLM to be able to answer the prompt is much harder than people suspect. You need to take into account chat history, the type of question (depending of that, the pipeline to success is different). We have several steps/routing strategy in place, and some to do openai calls (which is a penalty for speed) and there is still lot’s of room for improvement in this domain.

C) vector search: this is challenging ops to do at scale

D) postprocessing final LLM answer: this part if a bit easier (e.g. changing page number link) but it’s hard to close all potential bugs since openAI output is rather unpredictable with a lot’s of edge cases.

C.

sreenireddy · February 8, 2024, 6:24pm

Thanks. Is there a particular library to use for knowledge graphs for formatting. That one need to be solved first to create sort of knowledge based to build RAG for us

clb · February 9, 2024, 7:04am

We don’t use any framework or lib for the whole architecture (i mean no LLM specific lib of course). I think the current libs I have seen are using the wrong kind of abstraction and just make things harder IMHO. Knowledge graph is just a data structure you can build with low level tools (in our case Redis & JSON).

Topic		Replies	Views
RAG Assessment AI Discussions ai-discussions	0	27	February 19, 2025
RAG for college course catelog AI Discussions ai-discussions , chatgpt , langchain , large-language-model , project	1	319	February 1, 2024
Create a GPT with owned Data & RAG AI Discussions ai-discussions , project	6	370	July 7, 2024
PDF with tabular data AI Discussions ai-discussions , project	9	2116	March 22, 2024
To use or not to use RAG AI Discussions ai-discussions	4	73	April 4, 2025

docAnalyzer - chat with large PDF dataset

Related topics