Need for character splitter and token splitter

vivek37.kumar · January 7, 2024, 2:57am

Why do we need recursive splitter to split all the text into chunks before using token splitter? Can’t we directly use token splitter to split whole text into chunks of context window? What is the benefit in former method?

happyday · January 8, 2024, 12:34pm

I have the same question!

happyday · January 8, 2024, 6:36pm

I’ve been thinking about this. Here’s my take. The presenter noted SentenceTransformersTokenTextSplitter() has an input context window max of 256 tokens. He used the RecursiveCharacterTextSplitter() to break the text into 1000 words, which will average out to about 256 tokens. So that’s what I’m thinking until someone corrects us

but what I would think is better since he notes character text splitting loses sentence meaning… why not split the text first with NLTKTextSplitter() and maintain the sentence meaning then token split?

Topic		Replies	Views
Sentence text splitters and chunk/overlap sizes? LangChain for LLM Application Development	6	1701	July 20, 2023
How generate chunks that keep the paragraphs meaning in the sppliting process AI Discussions	0	56	July 7, 2023
Advanced retrieval for AI with Chroma Advanced Retrieval for AI with Chroma week-1	0	254	February 9, 2024
SentenceTransformers embedding tool has a maximum context length of 256 tokens, not characters, right? Advanced Retrieval for AI with Chroma	1	188	January 5, 2024
Document splitting: Chunksize LangChain for LLM Application Development	0	101	July 6, 2023

Need for character splitter and token splitter

Related topics