PDF with tabular data

Hi,
I am trying to build a RAG based Q&A product, data is in PDF files but PDF files contain mix of text, tables (with numbers) and some graphs.
After many attempts it seems like RAG is not able to retrieve accurate information, can anyone help me with alternative models to try or suggestions to improve answers.

1 Like

Have you tried a Hybrid Model approach. For instance, use a text-based model for narrative content and a different model trained on tabular data for queries related to tables. Fine-Tuning on specific could help a lot . My humble opinion only. Cheers

1 Like

Thank you @chuaal, I thought about that too but how would I separate the segments (text, table, figure/chart) without losing context? Tables and figures in those tables only make sense when looked at in the context of text before and after the tables.

1 Like

If your inputs have tables and graphs, you will need to use other models to augment. You can try a few things:

  1. Use an objective detection model, e.g. YOLO, to identify tables and graphs first. So you can extract them out and process separately. The processing of tables and graphs depends on what information you want to retain as input to LLMs.

  2. Or you can try this model: converting tables to text. There are models trained for this purpose (RUCAIBox/mtl-data-to-text Ā· Hugging Face ).

  3. After that, you can consider a method ā€œRecursive Retriever + Query Engineā€. You can refer to this demo Recursive Retriever + Query Engine Demo - LlamaIndex šŸ¦™ 0.9.34

Hope this is helpful.

7 Likes

Thank you @YANG_FAN, this is really helpful. Starting with mtl data to text, keeping fingers crossed.

2 Likes

Hi.
Iā€™m having the same issue extracting information from PDF files with a mix of text and tables. Iā€™m using Azure Document Intelligence to extract that info. It works fine for text, scanned text and tables.
The output is a json file with the text, and other things, with geometric coordinates of the text, so you can tell if a text is close to a tablet.
If you can, give it a try.

1 Like

Hi haroldc!

May I ask you a question on document intelligence:

I am building a chatbot in Azure OpenAi Studio. My documents contain many tables. I want to use the prebuilt layout model from Document Intelligence to improve the results.

You write that you use the document intelligence studio.

My question: Can you please tell me how you ā€šaddā€˜ the documents you have processed/analyzed into your model? You write that you use the JSON file - can you tell me more about it? I hope my question is clear. Thank you

Hi. Itā€™s a long trip from extracting info from documents to use it in a chatbot model, but basically I use the standard RAG strategy: get the data (text), chunk it, vectorize it (embedding process) and indexed it. Search for any RAG guide, there are plenty over there, even in Youtube.

Thank you, haraldc for your quick answer. I wish you best of luck and thank you for your suggestion to use RAG. Best, Katarina

1 Like

Just wanted to share with everyone that I came across parse service launched by llamaIndex LlamaCloud, its not perfect but works better than pypdf