Document Splitting

jyotirmoy_devops · July 11, 2023, 2:56pm

While trying to create a document embedding for a PDF I was trying to split it, however I am little confused about what is the difference between the two

loader = PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”)
pages = loader.load()
text_splitter = CharacterTextSplitter(
separator=“\n”,
chunk_size=1000,
chunk_overlap=150,
length_function=len
)
docs = text_splitter.split_documents(pages)

AND

loader = PyPDFLoader(fileName)
pages = loader.load_and_split()

When Implementing QA I see a lot of difference between the two

Any help is going to be helpful

Erlebach · October 5, 2023, 2:22am

Nobody answered you? I am using DirectoryLoader and noticed that it is doing chunking by default. I can’t figure out how to change the defaults. Any insights?

Topic		Replies	Views
Document splitting: Chunksize LangChain for LLM Application Development	0	101	July 6, 2023
DirectoryLoader and Chunks LangChain: Chat with Your Data	0	209	October 5, 2023
Loading markdown from file for splitting LangChain for LLM Application Development	1	663	February 21, 2024
02_document_splitting -- file not found LangChain: Chat with Your Data	1	136	February 13, 2024
Data cleaning supported with DocumentLoaders? LangChain: Chat with Your Data	0	70	July 17, 2023

Document Splitting

Related topics