PDF with tabular data

Ather · January 19, 2024, 7:55pm

Hi,
I am trying to build a RAG based Q&A product, data is in PDF files but PDF files contain mix of text, tables (with numbers) and some graphs.
After many attempts it seems like RAG is not able to retrieve accurate information, can anyone help me with alternative models to try or suggestions to improve answers.

chuaal · January 19, 2024, 8:41pm

Have you tried a Hybrid Model approach. For instance, use a text-based model for narrative content and a different model trained on tabular data for queries related to tables. Fine-Tuning on specific could help a lot . My humble opinion only. Cheers

Ather · January 20, 2024, 3:07am

Thank you @chuaal, I thought about that too but how would I separate the segments (text, table, figure/chart) without losing context? Tables and figures in those tables only make sense when looked at in the context of text before and after the tables.

YANG_FAN · January 20, 2024, 10:40am

If your inputs have tables and graphs, you will need to use other models to augment. You can try a few things:

Use an objective detection model, e.g. YOLO, to identify tables and graphs first. So you can extract them out and process separately. The processing of tables and graphs depends on what information you want to retain as input to LLMs.
Or you can try this model: converting tables to text. There are models trained for this purpose (RUCAIBox/mtl-data-to-text · Hugging Face ).
After that, you can consider a method “Recursive Retriever + Query Engine”. You can refer to this demo Recursive Retriever + Query Engine Demo - LlamaIndex 🦙 0.9.34

Hope this is helpful.

Ather · January 21, 2024, 5:11am

Thank you @YANG_FAN, this is really helpful. Starting with mtl data to text, keeping fingers crossed.

haroldc · February 3, 2024, 7:52pm

Hi.
I’m having the same issue extracting information from PDF files with a mix of text and tables. I’m using Azure Document Intelligence to extract that info. It works fine for text, scanned text and tables.
The output is a json file with the text, and other things, with geometric coordinates of the text, so you can tell if a text is close to a tablet.
If you can, give it a try.

Katarina1 · March 7, 2024, 8:23pm

Hi haroldc!

May I ask you a question on document intelligence:

I am building a chatbot in Azure OpenAi Studio. My documents contain many tables. I want to use the prebuilt layout model from Document Intelligence to improve the results.

You write that you use the document intelligence studio.

My question: Can you please tell me how you ‚add‘ the documents you have processed/analyzed into your model? You write that you use the JSON file - can you tell me more about it? I hope my question is clear. Thank you

haroldc · March 9, 2024, 4:07pm

Hi. It’s a long trip from extracting info from documents to use it in a chatbot model, but basically I use the standard RAG strategy: get the data (text), chunk it, vectorize it (embedding process) and indexed it. Search for any RAG guide, there are plenty over there, even in Youtube.

Katarina1 · March 9, 2024, 7:23pm

Thank you, haraldc for your quick answer. I wish you best of luck and thank you for your suggestion to use RAG. Best, Katarina

Ather · March 22, 2024, 6:42am

Just wanted to share with everyone that I came across parse service launched by llamaIndex LlamaCloud, its not perfect but works better than pypdf

Topic		Replies	Views
LLMs chat with PDFs AI Discussions llm	2	322	January 21, 2024
Chat with tabular data AI Discussions ai-discussions	0	197	January 21, 2024
How to work with PDF files that has tables in it? LangChain: Chat with Your Data	2	251	July 28, 2023
RAG - Parsing and Chunking the text AI Discussions ai-discussions , langchain	0	129	May 17, 2024
Seeking Advice: Integrating LLM with Large Local Document Databases AI Discussions ai-discussions	8	5654	January 28, 2025

PDF with tabular data

Related topics