Optimizing RAG-based AI Assistant for High-Volume

tonysaha · May 15, 2025, 3:19am

I’m building an AI assistant using Retrieval-Augmented Generation (RAG) with my own large-scale transaction data. Each transaction is stored as an individual chunk, and due to the high volume (thousands per day), the vector database returns too many results when answering questions. This leads to token limits during generation, especially when top_k needs to be very high (e.g., 1000+) to retrieve meaningful context, but I currently use a much lower value like 5. I’m looking for best practices to:

Efficiently chunk or group transaction data for better semantic relevance
Reduce the number of retrieved chunks without losing important context
Handle token limitations in vector-based retrieval
Improve overall performance and accuracy of my RAG pipeline on large transactional datasets

Any advice or architecture suggestions are welcome.

Topic		Replies	Views
Scaling RAG QA with Large Docs, Tables, and 30K+ Chunks AI Discussions ai-discussions , data-centric	0	34	June 2, 2025
Is there a way to receive the answer after the max token? AI Discussions ai-discussions	5	412	April 24, 2024
Why and when to use RAG AI Discussions ai-discussions , llm	11	1971	August 28, 2024
L2 - Basic RAG Pipeline Chunking Strategy Building and Evaluating Advanced RAG Applications	0	290	January 30, 2024
Handling large number of tokens ChatGPT Prompt Engineering for Developers	4	206	April 29, 2023

Optimizing RAG-based AI Assistant for High-Volume

Related topics