How to perform Question Answering on Multiple JSON Files: Beyond SQL Query Generation with LLMs

Hello OpenAI Community,

I am currently exploring various approaches to implement a question-answering system on multiple JSON files. Here’s a quick summary of the methods I have already implemented:

  1. SQL Query Generation with LLMs:
  • I built a system where the user’s natural language query is converted into an SQL query using an LLM.
  • The generated SQL query is then executed on a relational database to fetch the relevant records.
  1. MongoDB Query Generation with LLMs:
  • Similarly, I developed a system where the user’s natural language query is translated into a MongoDB query using an LLM.
  • The generated MongoDB query is then executed on a MongoDB database to retrieve relevant data.
  1. LangChain JSON Toolkit:
  • Using LangChain’s JSON Toolkit agent, I implemented a solution to handle question answering on JSON files by generating relevant MongoDB queries via an LLM and executing them on the database.
  1. Combined JSON Data with Python Code Generation:
  • I combined all JSON file data and used an LLM to generate Python code dynamically to query the combined data programmatically.

While these methods have been effective, I am now seeking alternative approaches to enhance the robustness and versatility of my system. Specifically, I would like to explore methods that:

  • Do not rely solely on query generation.
  • Are suitable for handling unstructured or semi-structured JSON data.
  • Optimize for scalability and efficiency when dealing with large JSON datasets.

I would love to hear your suggestions, insights, and any other possible approaches or tools that could help me improve my system!

Looking forward to your ideas and recommendations.

I am strugling too with processing large JSON by LLM for diverse kind of operations like making bulk of changes and creation of objectives, features and strategic themes etc.
Applied some techniques that you have already implemented. Now I am applying chunking strategies where facing the challenge of context losing as in some of my cases data loss is not bearable.
In your case, I would suggest to go for meaningful chunking of your data and then Embedding-Based Retrieval (Vector Search) that might help you to make a diverse solution.
I would like to see your current implementations if possible. May be I could suggest better and learn something from you