PDF parsing

Hello, I love learning and using llamaindex. Couple of questions regarding processing PDFs: You have a PDFReader class, SimpleDirectoryReader and LlamaParse. Which one to use for which use case? Any recommendations you can share will be appreciated.

What I see is PDFReader uses PyPDF library. LlamaParse uses image capture plus LLM to do OCR. Not much info regarding SimpleDirectoryReader.


Hello @PremSea ,

For processing PDFs with llamaindex, use the PDFReader class if your PDFs are primarily text-based (it uses PyPDF). If your PDFs contain scanned images or non-selectable text, LlamaParse is the better option as it uses image capture and OCR combined with LLM for more accurate extraction. SimpleDirectoryReader is designed for reading multiple files from a directory(useful for batch processing of documents).

Hope this helps!