I’m building a RAG-based document QA system using Python (no LangChain), LLaMA (50K context), PostgreSQL with pgvector, and Docling for parsing. Users can upload up to 10 large documents (300+ pages each), often containing numerous tables and charts.
I’m facing a few specific challenges:
30K+ total chunks across all docs → KNN retrieval gets noisy.
Tried LLM-based reranking, but it’s too slow and expensive to run on all 30K chunks.
Tried summarizing each chunk to improve retrieval, but:
It’s too expensive to generate LLM summaries for all 30K sections.
Table chunks are especially difficult:
Embeddings perform poorly on structured/numeric data.
Summary-style embeddings (e.g. first 300 tokens, or using just heading/caption) aren’t sufficient for value-level lookups.
Looking for ideas or proven strategies to:
Improve precision in initial retrieval at scale
Handle table-heavy content more effectively
Reduce cost while preserving accuracy
Any ideas, techniques, or tooling (besides LangChain) that worked for you?