How to approach JSON based RAG

First of all thank you very much for this course.

Do you guys have any recommendation for a JSON based rag. Most of the things I see are pdf or text based. I am trying to build a rag application that has multiple JSON files each containing information about specific neighboorhood. Each JSON contains information about the houses, nearby points of interest like schools, parks etc.

I see that we are just sticking stuff to the vector database. It’s not like a typical db where you have a key and an object and you use the key. How can I store this info in the vectorDB to make sure my rag application will retrieve the correct data.

Also how do I store the chunks of data because some points of interest might overlap other neighborhoods?

Last can I use rag to generate a comprehensive description of the neighborhood.

1 Like

HI @rotexhawk,

Have you had any lucky on finding an answer yet? I’ve been dealing with similar challenge but using tabular data (Excel, CSV). I’ve just completed the course Advanced Retrieval for AI with Chroma but as you’ve said it’s based on PDF and I couldn’t replicate the same concept for my use case.

Thanks,
Alexandre

Hello Alex,

I had to customize my rag application. Here is what I did.

  1. I added separate JSONs to the vector db and tagged it with the zip code and neighborhood name.

  2. The first question in my rag application is always, “Enter your neighborhood name or zipcode”.

  3. After that when the user asks a question, I add the metadata to the user question i.e what's the population age? gets converted to
    whats the population age? ${zipCode}. This is to make sure when I get the info it is relevant for a specific neighborhood.

  4. Convert the user question with the appended zipcode, to embeddings and let chroma get the info based on cosine similarity or whatever it uses to get the embeddings.

  5. I check the tags in the embeddings to make sure they don’t have data from other neighborhoods.

  6. Pass the filtered embeddings to the LLM to generate the response.

Hope that helped!

Thanks for your response @rotexhawk and sorry for not replying sooner.

In my case the user can make questions that answer may be based on the content of one or more columns loaded from the CSV. I’ve tried different techniques using HF transformers / pipeline and LLMs, as Langchain with chromadb, csv_agent as well but all of them either with very poor results or error that I couldn’t find a fix. Over the last weekend I’ve decide to subscribe to OpenAI and results have been improved a lot compared with previous experiments.

Thanks again,
Alexandre

hi Alexandre_de_Vasconc , i am also doing the same thing but with MongoDB documents which has several key-value pairs, So, what I did was I gave the MongoDB document as whole for vector embedding , and storing all the vector embeddings in a vector database , when a user asks question regarding any of that documents , vector database has to pickup all the documents related to it , and provide it to LLM for a clean , precise answer. but the vector database is giving me wrong documents, I have been trying this since many days ,the probelm what I am thinking is is at the embedding of the documents , the embedding model is not able to pick up the details of my documents , so that’s why wrong results, if there are any inputs which you can give , it will be helpful

Thanks

hi @rotexhawk , what i am doing is with MongoDB documents which has several key-value pairs which are basically json like structure, So, what I did was I gave the MongoDB document as whole for vector embedding , and storing all the vector embeddings in a vector database , when a user asks question regarding any of that documents , vector database has to pickup all the documents related to it , and provide it to LLM for a clean , precise answer. but the vector database is giving me wrong documents, I have been trying this since many days ,the probelm what I am thinking is is at the embedding of the documents , the embedding model is not able to pick up the details of my documents , so that’s why wrong results, if there are any inputs which you can give , it will be helpful

Thanks

I am not sure what your data looks like. Maybe you can add the original content or some sort of identifier in the vector db entries as metadata so when the vectors are retrieved you have some idea of what is passed to the prompt.

Hi @johndaniel,

It seems that we’re following a very similar approach. Below some piece of my code but as I said before the result’s improved when I changed to OpenAI. What LLM have you been using?

#load csv
loader = CSVLoader(file_path=“test-split.csv”)
docs = loader.load()

#embed entire document
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(
collection_name=“my_chroma”,
documents=docs,
embedding=embeddings
)

#search vectordb
result = vectordb.similarity_search(query, k=8)

hi @Alexandre_de_Vasconc , thanks for replying, i am using Openai’s gpt-4. I have also used the above piece of code but instead of CSVLoader , I used MongodbLoader from langchain , and I got the docs back , and instead of chroma I used MongoDB vector search , with embeddings and docs , and last step is same, but I getting wrong documents or unrelated documents , please let me know if you have any inputs , THANKS

hi @rotexhawk , thankyou for replying, i am providing a sample data which is replica of my data
{
“monument_id”: 12345,
“monument_name”: “The Grand Monument”,
“location”: {
“city”: “Cityville”,
“country”: “Landonia”
},
“year_built”: 1850,
“architect”: “John Architectson”,
“height_meters”: 50.5,
“materials”: [“marble”, “granite”],
“historical_significance”: True,
“visitor_count”: 500000,
“open_to_public”: True,
“maintenance_cost”: 10000.0,
“architectural_style”: “Neoclassical”,
“inscription”: “Dedicated to the spirit of progress”,
“monument_type”: “Landmark”,
“monument_condition”: “Well-maintained”,
“entry_fee”: 5.0,
“monument_category”: “Historical”,
“guided_tours_available”: True,
“featured_in_guidebooks”: [“Landonia Explorer”, “Monuments Today”],
“visitor_reviews”: [
{
“user_id”: “user123”,
“rating”: 4.5,
“comment”: “Amazing architecture and rich history!”
},
{
“user_id”: “user456”,
“rating”: 5.0,
“comment”: “A must-visit landmark in Cityville.”
}
]
}

the above is about a single monument document , in that way i have thousands of monument documents, which have several key-value pairs, so lets say I ask question like “what is the monument located in CityVille , Landonia” so basically i am asking for the above document , which is sent to LLM for clean answer. but when i query i am getting all wrong documents .what i am thinking is the way i am embedding the document is going wrong , that’s why i am not getting correct documents , hoping you get the above explanation , please let me know if you any inputs, THANKS

Hi @johndaniel, happy to share learnings and bumps on the road :slight_smile:

My results are not 100% accurate and not sure they will be considering the use of similarity and mmr searches. I’ve been doing loads of experimentation and as I write best results, using a test split, were achieved with the following options:

  • SOLUTION-1: prompt with csv_content as context (no retriever filter) passed on to LLM
  • SOLUTION-2: prompt using SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
  • SOLUTION-3: Vectrostore to filter relevant answers only then pass them on to LLM

I’ll continue exploring how I can improve both the system prompt and chunk/embedding steps as they seems to make the difference.

Hope above helps.

Thanks,
Alexandre

@johndaniel, you could try to apply LLM to get additional information from the query. It can help you gain valuable and structured information from user queries. Check the instructor and pydantic
for example:

from pydantic import BaseModel
import instructor
client = instructor.patch(OpenAI())

class Query(BaseModel):
    rewritten_query: str
    city: str
    country: str

def expand_query(q) -> Query:
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=Query,
        messages=[
            {
                "role": "system",
                "content": f"You're a query understanding system for the Metafor Systems search engine. Here are some tips: ...",
            },
            {"role": "user", "content": f"query: {q}"},
        ],
    )

query = expand_query("what is the monument located in CityVille , Landonia")
query

You will get something like this as output:

Query(rewritten_query='monument in CityVille, Landonia', city='CityVille', country='Landonia')

Here you can find additional information

Hi @j.k, apologies for jumping in the middle of your reply to @johndaniel but this is an area “question processing” that I’m going next to experiment as well. Have you had any experience using langchain to do that?

@Alexandre_de_Vasconc, unfortunately not. But it should be possible to use just Pydantic & Langchain for this.
Please leave a reply here when you find an interesting solution for Langchain :wink: Thanks.

Hello guys, I hope y’all are doing well,
I think I’m facing a similar issue here, and I want to get your thoughts about it since you have experimented with this type of challenge
In my case I have data stored in various data sources, and I have an LLM that does NL to SQL and grabs the data but I have to specify one and only one database,
the challenge for me is to query multiple(usually 2) databases that could be different and get the result needed , example: “I want to know the total number of students in LA”, and I specify database1, and database2
database1 has 300 student , and database2 has 200
the response should be “500 students”
another case is let’s say I want all students names, the LLM should get al the names in both DBs but they could have different column names example (Student_name and Namestudent)

I was thinking about storing all the data in one text or json file and using rag but I dont have a clear idea how to do so.
thank you guys in advance

@rotexhawk , @johndaniel
Hey mates if you could please help me out, so i want to create a RAG based application , but all my data is in JSON format. I tried converting it to text but that results in low accuracy. Is there any way we can make JSON as embeddings rather then texts.