L2 RecursiveCharacterTextSplitter behavior changed

McCheng · September 24, 2023, 10:04am

I find out that the behavior of RecursiveCharacterTextSplitter with Langchain v0.0.213 in the video is not consistent with the version I am having for LangChain v0.0.297.

In the video:

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

results in:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']

However, what I get in LangChain v0.0.297 is:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

I have figured out that I have to change the separators parameter to separators=["\n\n", "\n", ". ", " ", ""] in the latest version to obtain the same result.

However, in the following code:

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    keep_separator=True
)
r_splitter.split_text(some_text)

I am not able to obtain the expected result:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

How can I get the above result with the newest version of LangChain?

yerrochdi · September 29, 2023, 7:46am

Hi i got same issue i replace my seps separators=[“\n\n”, “\n”, ". ", " ", “”], the point still but better than before

McCheng · September 29, 2023, 5:45pm

Using separators=[“\n\n”, “\n”, ". ", " ", “”] gives us

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']

but not the one that appears in the course video:

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

The latter one clearly is more desirable.

Topic		Replies	Views
Langchain: text splitter behavior LangChain for LLM Application Development langchain	3	186	July 7, 2023
RecursiveCharacterTextSplitter Machine Learning Modeling Pipelines in Production	2	432	August 18, 2023
How generate chunks that keep the paragraphs meaning in the sppliting process AI Discussions	0	55	July 7, 2023
Document splitting: Chunksize LangChain for LLM Application Development	0	97	July 6, 2023
Need for character splitter and token splitter Advanced Retrieval for AI with Chroma	2	512	January 8, 2024

L2 RecursiveCharacterTextSplitter behavior changed

Related topics