Optimizing RAG-based AI Assistant for High-Volume

I’m building an AI assistant using Retrieval-Augmented Generation (RAG) with my own large-scale transaction data. Each transaction is stored as an individual chunk, and due to the high volume (thousands per day), the vector database returns too many results when answering questions. This leads to token limits during generation, especially when top_k needs to be very high (e.g., 1000+) to retrieve meaningful context, but I currently use a much lower value like 5. I’m looking for best practices to:

  • Efficiently chunk or group transaction data for better semantic relevance
  • Reduce the number of retrieved chunks without losing important context
  • Handle token limitations in vector-based retrieval
  • Improve overall performance and accuracy of my RAG pipeline on large transactional datasets

Any advice or architecture suggestions are welcome.