Document splitting: Chunksize

PhMueller · July 6, 2023, 7:05am

Hey,

first of all, thanks for the amazing course!
It is very informative and super nice to watch.

I have a question regarding some best practices for splitting the documents.
I’d like to use a sentence transformer (right now bert-based) to find the best similar chunks in a vector store.

Do you recommend a way of splitting text data in this case?
I currently try to split on a sentence level, however, that does not seem to work best.
Especially if the user query is short, the sentence long and the transformer is using a mean pooling at the end.

Do you have a tip how to split the data best?

Cheers and thanks a lot,

Philipp

Topic		Replies	Views
Advanced retrieval for AI with Chroma Advanced Retrieval for AI with Chroma week-module-1	0	256	February 9, 2024
Help choose the right text splitter for a CSV LangChain: Chat with Your Data	2	509	August 6, 2023
Sentence text splitters and chunk/overlap sizes? LangChain for LLM Application Development	6	1843	July 20, 2023
Summarizing across documents LangChain for LLM Application Development	8	489	July 8, 2023
LLMs chat with PDFs AI Discussions llm	2	335	January 21, 2024

Document splitting: Chunksize

Related topics