Hi all! If I’m not wrong this function has no splitting of long chunks, and no logic for “splitting at the middle or specific markers.”, besides having it in description. (Chunking lab)
def mixed_chunking(source_text):
“”"
Splits the given source_text into chunks using a mix of fixed-size and variable-size chunking.
It first splits the text by Asciidoc markers and then processes the chunks to ensure they are
of appropriate size. Smaller chunks are merged with the next chunk, and larger chunks can be
further split at the middle or specific markers within the chunk.
Args:
- source_text (str): The text to be chunked.
Returns:
- list: A list of text chunks.
"""
# Split the text by Asciidoc marker
chunks = source_text.split("\n==")
# Chunking logic
new_chunks = []
chunk_buffer = ""
min_length = 25
for chunk in chunks:
new_buffer = chunk_buffer + chunk # Create new buffer
new_buffer_words = new_buffer.split(" ") # Split into words
if len(new_buffer_words) < min_length: # Check whether buffer length is too small
chunk_buffer = new_buffer # Carry over to the next chunk
else:
new_chunks.append(new_buffer) # Add to chunks
chunk_buffer = ""
if len(chunk_buffer) > 0:
new_chunks.append(chunk_buffer) # Add last chunk, if necessary
return new_chunks
