Seeking Advice: Integrating LLM with Large Local Document Databases

Hey everyone,

I’m working on a project where I’m planning to use a LLM to answer questions based on a bunch of local documents (thousands) we have. These documents are all in PDF format, and we’re expecting to update them daily. The idea is to let the LLM dig through these documents to provide natural language answers to user queries.

I’ve been chatting with some developers, and a couple of questions came up:

  1. How can we set up the LLM to handle thousands of these local PDFs efficiently? We’ve heard that just embedding them into a vector database might not cut it for high-quality searches given the massive amount of data. Finetuning the model constantly isn’t really feasible since we’re updating the documents so often.

  2. Is there a good way to get the LLM to understand charts and tables in the PDFs (I am not talking about those without text descriptions like simple bar or bubble charts)?

Thanks a lot for any advice or insights you can share!

You should not feed PDF directly into the pipeline. You will need to extract the text from the document, tabula-py and pdfplumber came to my mind, but could be better tools nowadays. There are options to deal with tables but they don’t offer anything for charts. The quality of the data will depend on the origin of the PDF: those generated from text documents will works well, but those with complex structures will pose some challenges. My advise is to expend a good time trying to get data of quality, which will translate into better results. Once you extract the text, I recon that the database will be a just some megabytes. Some thousand documents are not massive datasets for LLMS.

Depending on the size of your dataset; the speed of the change of your document set; the importance of being up to date; and your resources(it will cost some money), you could decide a cadency for the fine tuning. It would be good to explore some on-line fine tuning and determine if that works for your project.

We are happy to hear more about your project and give you more advises if we you want.

1 Like

@andres.castillo covered many good points already. It’s definitely better to pull the text out of the PDFs and work with that instead.

I think you’ll need to do some experiments with your specific data to see whether vector database/RAG or fine-tuning is the better choice. I think it’s hard to tell unless you try it out.

In my personal opinion, it’s usually quicker/easier to use a vector database/RAG with some API than to fine-tune. I’d personally try that out first.

With regards to understanding charts and tables in PDFs, there are some vision models that do just that (ex. MatCha). They could be a little more difficult to set up, but if you really need it, then you don’t have a choice.

If you’re not worried about uploading your data to OpenAI, I think their API should also have support for PDFs and processing charts/tables in them (you should just be able to upload the file and chatGPT will analyze it for you). Obviously, it costs more and there’s the data privacy concern, but it’d likely be quicker to prototype with their API if you’re not worried about that.

Hey Andres @andres.castillo ,

Thanks a ton for breaking things down for me. If I got it right, you’re suggesting a workflow that looks something like this:

  • Start with solid libraries to pull info from PDFs (noticed PyMuPDF gets thumbs up too).
  • Move on to embeddings and tuck that data into a vector database.
  • Dive into some smart search techniques within that database.
  • And finally, bring in an LLM to dish out answers to users.

I’ve got a few more questions bubbling up about what you shared:

  • I’ve seen some chatter about folks using GPT Vision to chew through PDFs by first converting them into images and then letting GPT Vision do its thing on those pics. Seems like a bit of a roundabout to me. Have you ever taken this route?
  • Does picking one vector database over another have impact on the quality and speed of search? Chroma, Pinecone, Qdrant seem to be hitting the charts.
  • And on search strategies, I’m eyeballing things like Atlas Vector Search, Azure AI search. Got any favorites or tips?
    Really appreciate you lending me your brain on this!

Hey Hackyon @hackyon ,

Really appreciate your input and adding onto what Andres shared.

So, diving a bit deeper into the PDF extraction saga, my priority ranking for the data to pull out goes like this: text first, then tables, and charts last. Mainly because charts in PDFs often lack text info and are just visual.

In my previous message to Andres, I mentioned that some folks recommend GPT-Vision for dealing with PDFs. The catch is converting PDFs into images first. I’m on the fence about whether this is more efficient than sticking with libraries like PyMuPDF or tabula-py.

When it comes to uploading docs to OpenAI, I’m not sweating over data privacy much. But, I stumbled upon some chatter online about there being a cap of 20 files per Assistant, maxing out at 512 MB each. I’m scratching my head on whether this restriction is just for custom GPTs or if it spills over to using the chatgpt API too. Got any insights on this?

Thanks a bunch for helping me navigate through this!

The answer will depends on your setup. Is it your model local or are you using it through a web API?
Work on some prototypes first, start with the simples escenarios, datasets that fits into memory, and worry about OPs and performance later. However, for starting prototyping it is important to define some ways to evaluate that your model is working as expected. Work on some automated tests.

The selection of your tools will have an impact on performance for sure, but there are other variables to consider: Simplicity, availability and costs. Some tools will come in handy for prototyping and other will work for production environments. But first, get your prototype working.

@andres.castillo Hi Andres, thank you for your valuable advice.
Yes, I will start by creating an initial version of a simple prototype. In response to your question, my intention is to initially utilize the ChatGPT API.

Hi, PymuPDf seems to not be very exact when it comes to extracting tables from PDFs. Are there any better libraries that can do this?