Sorry for the late response on this. You have forgot to mention vectordb.persist()
at step 3.
Below is the complete code for your reference:
#%% Step 1: Load PDF
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)
# Step 2: Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)
# %% Step 3: Embed and vectorize and store
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
persist_directory = 'basic_langchain/chroma_storage'
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_texts(
texts=splits,
embedding=embedding,
persist_directory=persist_directory
)
vectordb.persist()
vectordb_loaded = Chroma(
persist_directory=persist_directory,
embedding_function=embedding
)
print(vectordb_loaded._collection.count())
I have tested above code and it is working fine. Also I have checked output folder basic_langchain/chroma_storage, required parquet files are present after persisting.