I have greatly enjoyed this course, however there is something I am unable to reproduce on my own. Namely: loading a saved Chromadb.
I use Andrew’s lecture as the PDF in the example below.
I can do steps 1-3 just fine but step 4 seems to fail.
Step 1: I load the PDF. len(docs) returns 22
Step 2: I split the documents in chunks. len(splits) returns 57
Step 3: Using OpenAI embeddings, I vectorize each chunk into a ChromaDB and write it to
disk. I see the files have been written (see screenshot).
Couple of things. Have you tried running with any other pdf where you can manually copy text. I am guessing the pdf which you have tried, has image embedded.
Could you please try with someother pdf file and let us know whether you are facing this issue again.
Something I just noticed as well is that using the notebook from the website, there seem to be two .parquet files that are not present in my chroma directory. Not sure if that has anything to do with it. I do have the .bin and .pkl files though.
Sorry for the late response on this. You have forgot to mention vectordb.persist() at step 3.
Below is the complete code for your reference:
#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)
# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)
# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
persist_directory = 'basic_langchain/chroma_storage'
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_texts(
texts=splits,
embedding=embedding,
persist_directory=persist_directory
)
vectordb.persist()
vectordb_loaded = Chroma(
persist_directory=persist_directory,
embedding_function=embedding
)
print(vectordb_loaded._collection.count())
I have tested above code and it is working fine. Also I have checked output folder basic_langchain/chroma_storage, required parquet files are present after persisting.