Langchain: text splitter behavior

GreenEye · July 6, 2023, 9:48pm

I don’t understand the behavior of Langchain recursive text splitter. Here is my code and output.

from langchain.text_splitter import RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
#     separators=["\n"]#, "\n", " ", ""]
)
test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)

Output

13
['a\nbcefg', 'hij\nk']

As you can see, it outputs chunks of size 7 and 5 and only splits on one of the new line characters. I was expecting output to be [‘a’,‘bcefg’,‘hij’,‘k’]

TMosh · July 6, 2023, 10:06pm

I have moved your thread out of the “General Discussions” area, and put it in what I think is the correct forum area.

GreenEye · July 7, 2023, 3:15am

Thanks. It should ideally be under the Short Course: Langchain Chat with your Data, but I cannot find that section here.

GreenEye · July 7, 2023, 1:59pm

Asked this question on stack overflow and got the answer
https://stackoverflow.com/questions/76633711/langchain-text-splitter-behavior

Topic		Replies	Views
L2 RecursiveCharacterTextSplitter behavior changed LangChain: Chat with Your Data	2	204	September 29, 2023
Document splitting: Chunksize LangChain for LLM Application Development	0	101	July 6, 2023
Help choose the right text splitter for a CSV LangChain: Chat with Your Data	2	482	August 6, 2023
L3-Chains: SequentialChain with LCEL? LangChain for LLM Application Development	0	133	June 17, 2024
Document Splitting LangChain for LLM Application Development	1	201	October 5, 2023

Langchain: text splitter behavior

Related topics