Unable to load a saved Chroma Database

Hello all,

I have greatly enjoyed this course, however there is something I am unable to reproduce on my own. Namely: loading a saved Chromadb.
I use Andrew’s lecture as the PDF in the example below.

I can do steps 1-3 just fine but step 4 seems to fail.

  1. Step 1: I load the PDF. len(docs) returns 22
  2. Step 2: I split the documents in chunks. len(splits) returns 57
  3. Step 3: Using OpenAI embeddings, I vectorize each chunk into a ChromaDB and write it to
    disk. I see the files have been written (see screenshot).

    vectordb._collection.count() returns 57
  4. Step 4: I try to load it but it doesn’t work. It returns no error but print(vectordb_loaded._collection.count()) returns 0

What am I doing wrong?

#%% Step 1: Document Loading
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("basic_langchain/machinelearning-lecture01.pdf")
docs = loader.load()
# len(docs) returns 22

# %% Step 2: Split document in chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
# len(splits) returns 3

# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


persist_directory = 'basic_langchain/chroma_storage'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
# print(vectordb._collection.count()) returns 57

# %% Step 4: Load the saved chroma db
embedding = OpenAIEmbeddings()
vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
#print(vectordb_loaded._collection.count()) returns 0

Welcome to the community.

Couple of things. Have you tried running with any other pdf where you can manually copy text. I am guessing the pdf which you have tried, has image embedded.

Could you please try with someother pdf file and let us know whether you are facing this issue again.

Thanks for the reply @raj . I also tried with a .txt file using:

from langchain.document_loaders import TextLoader

It still doesn’t seem to work. Is there anything wrong with the code I first posted?

Thanks for the update. I will try to reproduce this issue in my local and will keep you posted.

Thanks @raj !

Something I just noticed as well is that using the notebook from the website, there seem to be two .parquet files that are not present in my chroma directory. Not sure if that has anything to do with it. I do have the .bin and .pkl files though.

I think I have found the root cause. In step2, instead of loading simple strings in text_splitter.split_text(), you are loading document objects.

Please try with the following codes and let me know if it works

# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

Thanks @raj

I’ve update the code to match what you suggested.
I’m able to
1/load the PDF successfully

2/split the PDF

3/create a ChromaDB (replaced vectordb = Chroma.from_documents with Chroma.from_texts

4/ however I am still unable to load the ChromaDB from disk again. The code runs but print(vectordb_loaded._collection.count()) returns 0

Are you able to load the ChromaDB from disk and have it being non empty?

#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("basic_langchain/machinelearning-lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)



# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


persist_directory = 'basic_langchain/chroma_storage'

embedding = OpenAIEmbeddings()


vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
# print(vectordb._collection.count()) returns 45

# %% Step 4: Load the saved chroma db


vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
#print(vectordb_loaded._collection.count()) returns 0

Sorry for the late response on this. You have forgot to mention vectordb.persist() at step 3.

Below is the complete code for your reference:

#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

persist_directory = 'basic_langchain/chroma_storage'
embedding = OpenAIEmbeddings()

vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

vectordb.persist()

vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
print(vectordb_loaded._collection.count())

I have tested above code and it is working fine. Also I have checked output folder basic_langchain/chroma_storage, required parquet files are present after persisting.

image

I would suggest to check the folder content after executing persist function. You would notice the difference.

It works!! Thanks you @raj !!

2 Likes

Awesome, please feel free to reach out to us for any additional query.

1 Like

@raj how to avoid splitting/chunking if already a doc was added to the vectordb?