Extracting text from PDFs

ai4ki · December 11, 2023, 3:55pm

If I’m not mistaken, the SimpleDirectoryReader in Llama_index uses PyPDF2 to extract text from PDF files. While this is a great free open source tool, I have found that the quality of the extracted text becomes a performance issue in RAG-based systems when the PDF has a rather complex format (like a scientific article with a two-column layout, figures, tables, footers, headers, etc.).

As PyPDF is used quite often in this area, I wonder if anyone has studied this systematically? While for some use cases high accuracy in text extraction might not be too critical, for scientific purposes like AI-assisted literature reviews (which is what I’m working on) I think it’s of utmost importance. Any ideas on this or references to relevant literature are very welcome!

balaji.ambresh · December 13, 2023, 5:35am

Please see this coursera course which covers extracting text from a scientific paper with 2 columns.

OCR requires thinking about how you want to read sentences (1 column at a time) and understanding the document layout to get better results. As you rightly pointed out, interpreting a 2 column paper as 1 column text will yield incorrect results.

ai4ki · December 15, 2023, 4:18pm

Thanks! You’re absolutely right, OCR and (py)tesseract is certainly an option for extracting text from PDFs. But I have found it challenging to make it work reliably enough for arbitrary PDFs - selecting and stitching together the extracted text blocks in the correct order is difficult and error-prone.

I guess there are tools out there, which perform better for my use case than, say, PyPDF2 but they are either rather difficult to set up (like GROBID) or they are fee-based (like Adobe PDF Extract API).

The question for me is really: how much do PDF extraction errors matter in the context of (scholarly) RAG applications? Maybe a good enough LLM will smooth out certain types of errors (like a footer inserted into the main text), but I can imagine situations where important information is lost or misrepresented due to such errors.

balaji.ambresh · December 15, 2023, 5:16pm

It depends on how good you want the application to be. Trouble starts when a pdf to text converter considers a 2 column page as 1 column.
Have you heard of garbage in garbage out ?

ai4ki · December 19, 2023, 2:36pm

My use cases for a RAG application are certain scientific tasks, like literature reviews. And, yes, as with any scientific task, better data yields better answers. So, ideally, what I want is 100% accuracy in text extraction from PDFs–knowing full well it isn’t achievable.

Topic		Replies	Views
PDF with tabular data AI Discussions ai-discussions , project	9	2225	March 22, 2024
LLMs chat with PDFs AI Discussions llm	2	326	January 21, 2024
L2 - Basic RAG Pipeline Chunking Strategy Building and Evaluating Advanced RAG Applications	0	290	January 30, 2024
PDF parsing Building Agentic RAG with LlamaIndex	1	667	May 18, 2024
RAG - Parsing and Chunking the text AI Discussions ai-discussions , langchain	0	130	May 17, 2024

Extracting text from PDFs

Related topics