PDF with tabular data

I am trying to build a RAG based Q&A product, data is in PDF files but PDF files contain mix of text, tables (with numbers) and some graphs.
After many attempts it seems like RAG is not able to retrieve accurate information, can anyone help me with alternative models to try or suggestions to improve answers.

1 Like

Have you tried a Hybrid Model approach. For instance, use a text-based model for narrative content and a different model trained on tabular data for queries related to tables. Fine-Tuning on specific could help a lot . My humble opinion only. Cheers

1 Like

Thank you @chuaal, I thought about that too but how would I separate the segments (text, table, figure/chart) without losing context? Tables and figures in those tables only make sense when looked at in the context of text before and after the tables.

1 Like

If your inputs have tables and graphs, you will need to use other models to augment. You can try a few things:

  1. Use an objective detection model, e.g. YOLO, to identify tables and graphs first. So you can extract them out and process separately. The processing of tables and graphs depends on what information you want to retain as input to LLMs.

  2. Or you can try this model: converting tables to text. There are models trained for this purpose (RUCAIBox/mtl-data-to-text · Hugging Face ).

  3. After that, you can consider a method “Recursive Retriever + Query Engine”. You can refer to this demo Recursive Retriever + Query Engine Demo - LlamaIndex 🦙 0.9.34

Hope this is helpful.


Thank you @YANG_FAN, this is really helpful. Starting with mtl data to text, keeping fingers crossed.


I’m having the same issue extracting information from PDF files with a mix of text and tables. I’m using Azure Document Intelligence to extract that info. It works fine for text, scanned text and tables.
The output is a json file with the text, and other things, with geometric coordinates of the text, so you can tell if a text is close to a tablet.
If you can, give it a try.

1 Like