Unable to load a saved Chroma Database

data_toaster · September 6, 2023, 4:29am

Hello all,

I have greatly enjoyed this course, however there is something I am unable to reproduce on my own. Namely: loading a saved Chromadb.
I use Andrew’s lecture as the PDF in the example below.

I can do steps 1-3 just fine but step 4 seems to fail.

Step 1: I load the PDF. len(docs) returns 22
Step 2: I split the documents in chunks. len(splits) returns 57
Step 3: Using OpenAI embeddings, I vectorize each chunk into a ChromaDB and write it to
disk. I see the files have been written (see screenshot).

lang988×640 49.5 KB

vectordb._collection.count() returns 57
Step 4: I try to load it but it doesn’t work. It returns no error but print(vectordb_loaded._collection.count()) returns 0

What am I doing wrong?

#%% Step 1: Document Loading
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("basic_langchain/machinelearning-lecture01.pdf")
docs = loader.load()
# len(docs) returns 22

# %% Step 2: Split document in chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
# len(splits) returns 3

# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


persist_directory = 'basic_langchain/chroma_storage'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
# print(vectordb._collection.count()) returns 57

# %% Step 4: Load the saved chroma db
embedding = OpenAIEmbeddings()
vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
#print(vectordb_loaded._collection.count()) returns 0

raj · September 6, 2023, 6:53am

Welcome to the community.

Couple of things. Have you tried running with any other pdf where you can manually copy text. I am guessing the pdf which you have tried, has image embedded.

Could you please try with someother pdf file and let us know whether you are facing this issue again.

data_toaster · September 6, 2023, 3:22pm

Thanks for the reply @raj . I also tried with a .txt file using:

from langchain.document_loaders import TextLoader

It still doesn’t seem to work. Is there anything wrong with the code I first posted?

raj · September 6, 2023, 3:24pm

Thanks for the update. I will try to reproduce this issue in my local and will keep you posted.

data_toaster · September 6, 2023, 3:31pm

Thanks @raj !

Something I just noticed as well is that using the notebook from the website, there seem to be two .parquet files that are not present in my chroma directory. Not sure if that has anything to do with it. I do have the .bin and .pkl files though.

raj · September 6, 2023, 5:16pm

I think I have found the root cause. In step2, instead of loading simple strings in text_splitter.split_text(), you are loading document objects.

Please try with the following codes and let me know if it works

# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

data_toaster · September 6, 2023, 7:55pm

Thanks @raj

I’ve update the code to match what you suggested.
I’m able to
1/load the PDF successfully

2/split the PDF

3/create a ChromaDB (replaced vectordb = Chroma.from_documents with Chroma.from_texts

4/ however I am still unable to load the ChromaDB from disk again. The code runs but print(vectordb_loaded._collection.count()) returns 0

Are you able to load the ChromaDB from disk and have it being non empty?

#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("basic_langchain/machinelearning-lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)



# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


persist_directory = 'basic_langchain/chroma_storage'

embedding = OpenAIEmbeddings()


vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
# print(vectordb._collection.count()) returns 45

# %% Step 4: Load the saved chroma db


vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
#print(vectordb_loaded._collection.count()) returns 0

raj · September 8, 2023, 6:07pm

Sorry for the late response on this. You have forgot to mention vectordb.persist() at step 3.

Below is the complete code for your reference:

#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

persist_directory = 'basic_langchain/chroma_storage'
embedding = OpenAIEmbeddings()

vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

vectordb.persist()

vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
print(vectordb_loaded._collection.count())

I have tested above code and it is working fine. Also I have checked output folder basic_langchain/chroma_storage, required parquet files are present after persisting.

raj · September 8, 2023, 6:09pm

I would suggest to check the folder content after executing persist function. You would notice the difference.

data_toaster · September 8, 2023, 9:10pm

It works!! Thanks you @raj !!

raj · September 9, 2023, 1:25am

Awesome, please feel free to reach out to us for any additional query.

Jithin_James · February 15, 2024, 12:00pm

@raj how to avoid splitting/chunking if already a doc was added to the vectordb?

Topic		Replies	Views
ChromaDB issue in Vectorstores and Embedding LangChain: Chat with Your Data	7	1114	October 24, 2023
Getting error when creating Chroma vector store from PDF LangChain for LLM Application Development	0	158	July 11, 2023
Chromadb Vector database LangChain: Chat with Your Data	12	2191	May 9, 2025
RetrievalQA not recognizing the files i am uploading LangChain: Chat with Your Data week-1	1	125	May 14, 2024
Is OpenAIEmbeddings() being loaded? LangChain: Chat with Your Data	1	204	July 9, 2023

Unable to load a saved Chroma Database

Related topics