Sentence text splitters and chunk/overlap sizes?

happyday · July 12, 2023, 4:27pm

I am fairly new to nlp so forgive me if this is an obvious question. It is not obvious to me. It is not obvious to me how sentence splitters like NLTKTextSplitter and SpacyTextSplitter determine how many sentences to put in a chunk. i.e.: what does chunk and overlap size mean for these two splitters? I see that text splitters inherit from the TextSplitter class (where the defaults for chunk and overlap sizes are set to 4000 (chunk size) and 200 (overlap size). But it is not clear to me the sentence splitters use these? Although the chunks I get back are all around 4000 (always less)…but while it is completely obvious why the chunk size is the way it is for CharacterTextSplitter, how does this work for the sentence text splitters? Thank you.

isaac.casm · July 12, 2023, 7:09pm

I have not used them for a while. But I think to remember that the different chunks are obtained using the separator, so whenever that character is found it splits the input text. However, they will be cropped to a maximum of chunk_size. Thechunk_overlap is the number of characters overlapping between consecutive sentences, an overlap of 0 would mean that the splitting will occur exactly at the location of the separator and no overlapping character would be found. If I don’t remember wrong it will overlap that number with the previous and next chunk.

Hope it heps

happyday · July 12, 2023, 7:19pm

Thankyou @isaac.casm for your reply. I probably wasn’t very clear. I’m having challenges with understanding how the chunk size/overlap size is determined for the NTLKTextSplitter and SpacyTextSplitter.

Again, I thank you for your kind reply.

isaac.casm · July 13, 2023, 4:43pm

I totally misunderstood you. You want to know what values are good for these two variables if I got it right this time. So, the values depend a lot on the application, memory, and speed requirements. But, you can always optimise them using cross-validation within the range of values that are good, in terms of memory and speed, to get these parameters right. I would personally keep the chunk_overlap between 5% and 20% of the chunk_size, but if the chunk_size is really small, larger size may be better.

happyday · July 14, 2023, 11:01am

Thank you. But my question is even more basic than that. When I use NLTKTextSplitter I do not see a way to specify the chunk size. I guess I must look deeper. I thought this might be a challenge for others. The chunks are around 4K. This is rather large for some models so I wanted chunks of 1K. A

isaac.casm · July 15, 2023, 4:52pm

Ok, I see your problem. So, the NLTKTextSplitter inherits from the class TextSplitter which is the one containing these parameters. If you check the code, there should be something like super().__init__(**kwargs) in the constructor of NLTKTextSplitter, which is used to pass the parameters **kwargs to the constructor of TextSplitter. Thus, you can use all the parameters of TextSplitter in NLTKTextSplitter.

In conclusion, you can just use:
obj = NLTKTextSplitter(chunk_size = 4000, chunk_overlap = 200)

Hope this time is what you look after

happyday · July 20, 2023, 12:38pm

@isaac.casm Thank you. You are absolutely right. It was right there in my face. That was quite a stupid mistake on my part. Thank you very much for your kindness. I hope you find many things to smile about.

Topic		Replies	Views
Need for character splitter and token splitter Advanced Retrieval for AI with Chroma	2	516	January 8, 2024
How generate chunks that keep the paragraphs meaning in the sppliting process AI Discussions	0	56	July 7, 2023
Document splitting: Chunksize LangChain for LLM Application Development	0	101	July 6, 2023
Merging text from chunk window Knowledge Graphs for RAG	0	60	June 11, 2024
Langchain: text splitter behavior LangChain for LLM Application Development langchain	3	196	July 7, 2023

Sentence text splitters and chunk/overlap sizes?

Related topics