If I’m not mistaken, the SimpleDirectoryReader in Llama_index uses PyPDF2 to extract text from PDF files. While this is a great free open source tool, I have found that the quality of the extracted text becomes a performance issue in RAG-based systems when the PDF has a rather complex format (like a scientific article with a two-column layout, figures, tables, footers, headers, etc.).
As PyPDF is used quite often in this area, I wonder if anyone has studied this systematically? While for some use cases high accuracy in text extraction might not be too critical, for scientific purposes like AI-assisted literature reviews (which is what I’m working on) I think it’s of utmost importance. Any ideas on this or references to relevant literature are very welcome!
Please see this coursera course which covers extracting text from a scientific paper with 2 columns.
OCR requires thinking about how you want to read sentences (1 column at a time) and understanding the document layout to get better results. As you rightly pointed out, interpreting a 2 column paper as 1 column text will yield incorrect results.
Thanks! You’re absolutely right, OCR and (py)tesseract is certainly an option for extracting text from PDFs. But I have found it challenging to make it work reliably enough for arbitrary PDFs - selecting and stitching together the extracted text blocks in the correct order is difficult and error-prone.
I guess there are tools out there, which perform better for my use case than, say, PyPDF2 but they are either rather difficult to set up (like GROBID) or they are fee-based (like Adobe PDF Extract API).
The question for me is really: how much do PDF extraction errors matter in the context of (scholarly) RAG applications? Maybe a good enough LLM will smooth out certain types of errors (like a footer inserted into the main text), but I can imagine situations where important information is lost or misrepresented due to such errors.
It depends on how good you want the application to be. Trouble starts when a pdf to text converter considers a 2 column page as 1 column.
Have you heard of garbage in garbage out ?
My use cases for a RAG application are certain scientific tasks, like literature reviews. And, yes, as with any scientific task, better data yields better answers. So, ideally, what I want is 100% accuracy in text extraction from PDFs–knowing full well it isn’t achievable.