PDF parsing

PremSea · May 17, 2024, 10:43pm

Hello, I love learning and using llamaindex. Couple of questions regarding processing PDFs: You have a PDFReader class, SimpleDirectoryReader and LlamaParse. Which one to use for which use case? Any recommendations you can share will be appreciated.

What I see is PDFReader uses PyPDF library. LlamaParse uses image capture plus LLM to do OCR. Not much info regarding SimpleDirectoryReader.

thanks!

Alireza_Saei · May 18, 2024, 5:57am

Hello @PremSea ,

For processing PDFs with llamaindex, use the PDFReader class if your PDFs are primarily text-based (it uses PyPDF). If your PDFs contain scanned images or non-selectable text, LlamaParse is the better option as it uses image capture and OCR combined with LLM for more accurate extraction. SimpleDirectoryReader is designed for reading multiple files from a directory(useful for batch processing of documents).

Hope this helps!

Topic		Replies	Views
Extracting text from PDFs Building and Evaluating Advanced RAG Applications	4	1344	December 19, 2023
Node Parser not working! Building Agentic RAG with LlamaIndex	0	19	July 12, 2024
Document Splitting LangChain for LLM Application Development	1	202	October 5, 2023
Help with pdf data AI Discussions project	7	161	September 12, 2024
LlamaParse strange document segmentation Event-Driven Agentic Document Workflows	0	64	March 8, 2025

PDF parsing

Related topics