I find out that the behavior of RecursiveCharacterTextSplitter
with Langchain v0.0.213 in the video is not consistent with the version I am having for LangChain v0.0.297.
In the video:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
results in:
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
'. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
'Paragraphs are often delimited with a carriage return or two carriage returns',
'. Carriage returns are the "backslash n" you see embedded in this string',
'. Sentences have a period at the end, but also, have a space.and words are separated by space.']
However, what I get in LangChain v0.0.297 is:
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
I have figured out that I have to change the separators
parameter to separators=["\n\n", "\n", ". ", " ", ""]
in the latest version to obtain the same result.
However, in the following code:
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=["\n\n", "\n", "(?<=\. )", " ", ""],
keep_separator=True
)
r_splitter.split_text(some_text)
I am not able to obtain the expected result:
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
'Paragraphs are often delimited with a carriage return or two carriage returns.',
'Carriage returns are the "backslash n" you see embedded in this string.',
'Sentences have a period at the end, but also, have a space.and words are separated by space.']
How can I get the above result with the newest version of LangChain?