Mixed chunking function

Hi all! If I’m not wrong this function has no splitting of long chunks, and no logic for “splitting at the middle or specific markers.”, besides having it in description. (Chunking lab)

def mixed_chunking(source_text):
“”"
Splits the given source_text into chunks using a mix of fixed-size and variable-size chunking.
It first splits the text by Asciidoc markers and then processes the chunks to ensure they are
of appropriate size. Smaller chunks are merged with the next chunk, and larger chunks can be
further split at the middle or specific markers within the chunk.

Args:
- source_text (str): The text to be chunked.

Returns:
- list: A list of text chunks.
"""

# Split the text by Asciidoc marker
chunks = source_text.split("\n==")

# Chunking logic
new_chunks = []
chunk_buffer = ""
min_length = 25

for chunk in chunks:
    new_buffer = chunk_buffer + chunk  # Create new buffer
    new_buffer_words = new_buffer.split(" ")  # Split into words
    if len(new_buffer_words) < min_length:  # Check whether buffer length is too small
        chunk_buffer = new_buffer  # Carry over to the next chunk
    else:
        new_chunks.append(new_buffer)  # Add to chunks
        chunk_buffer = ""

if len(chunk_buffer) > 0:
    new_chunks.append(chunk_buffer)  # Add last chunk, if necessary

return new_chunks

I’m not a mentor for this course.

You are correct in observing that no action is taken to process larger chunks. That said, please look at the wording in the comment that indicates that further processing can be done on larger chunks and doesn’t say that this additional step is done within the function body:

larger chunks can be
further split at the middle or specific markers within the chunk.

hi @NinoKereselidze

just to be clear with your question, when you mention this function?? has no splitting of long chunk, you are pointing to which function in the code labs you have shared here?

Hi! I refer to mixed_chunking function which is defined in module 3 chunking lab

i will confirm that in a while once I go through the lab codes. i Will check the lab codes and get back to you

hi @NinoKereselidze

the mixed chunking function is used in the subsequent cell after the cell codes you shared here, to check the test output irrespective of chunk length is too small or large.

Also when I went through further cells, I noticed the same mixed chunking function being used to implement a mixed strategic chunking using paragraph markers with minimum chunk length of 25 in the def build chunk obj