I am fairly new to nlp so forgive me if this is an obvious question. It is not obvious to me. It is not obvious to me how sentence splitters like NLTKTextSplitter
and SpacyTextSplitter
determine how many sentences to put in a chunk. i.e.: what does chunk and overlap size mean for these two splitters? I see that text splitters inherit from the TextSplitter
class (where the defaults for chunk and overlap sizes are set to 4000 (chunk size) and 200 (overlap size). But it is not clear to me the sentence splitters use these? Although the chunks I get back are all around 4000 (always less)…but while it is completely obvious why the chunk size is the way it is for CharacterTextSplitter, how does this work for the sentence text splitters? Thank you.
I have not used them for a while. But I think to remember that the different chunks are obtained using the separator, so whenever that character is found it splits the input text. However, they will be cropped to a maximum of chunk_size
. Thechunk_overlap
is the number of characters overlapping between consecutive sentences, an overlap of 0 would mean that the splitting will occur exactly at the location of the separator and no overlapping character would be found. If I don’t remember wrong it will overlap that number with the previous and next chunk.
Hope it heps
Thankyou @isaac.casm for your reply. I probably wasn’t very clear. I’m having challenges with understanding how the chunk size/overlap size is determined for the NTLKTextSplitter and SpacyTextSplitter.
Again, I thank you for your kind reply.
I totally misunderstood you. You want to know what values are good for these two variables if I got it right this time. So, the values depend a lot on the application, memory, and speed requirements. But, you can always optimise them using cross-validation within the range of values that are good, in terms of memory and speed, to get these parameters right. I would personally keep the chunk_overlap
between 5% and 20% of the chunk_size
, but if the chunk_size
is really small, larger size may be better.
Thank you. But my question is even more basic than that. When I use NLTKTextSplitter I do not see a way to specify the chunk size. I guess I must look deeper. I thought this might be a challenge for others. The chunks are around 4K. This is rather large for some models so I wanted chunks of 1K. A
Ok, I see your problem. So, the NLTKTextSplitter inherits from the class TextSplitter which is the one containing these parameters. If you check the code, there should be something like super().__init__(**kwargs)
in the constructor of NLTKTextSplitter, which is used to pass the parameters **kwargs to the constructor of TextSplitter. Thus, you can use all the parameters of TextSplitter in NLTKTextSplitter.
In conclusion, you can just use:
obj = NLTKTextSplitter(chunk_size = 4000, chunk_overlap = 200)
Hope this time is what you look after
@isaac.casm Thank you. You are absolutely right. It was right there in my face. That was quite a stupid mistake on my part. Thank you very much for your kindness. I hope you find many things to smile about.