Chunking ungraded lab C1M3_Ungraded_Lab

In the section regarding Chunking with overlap, the start and end of each interval/chunk are constructed with chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]

I’m wondering if the offset for the end of the interval shouldn’t contain - overlap_int as i + chunk_size - overlap_in ? ( the 1st chunk differs from the rest)

Course Q&A > Retrieval Augmented Generation week-module-3

Please explain this (with an example):

well, the first interval will be identical to the case without overlap between chunks. Say you have a fixed length of 100 words/tokens and 10% overlap. The 1st interval starts at 0 ( because it can’t start at a negative word index) and ends at 100 (as in the case without interval) because we are in the case of fixed-sized chunk. While the 2d chunk would start at 90.

Sorry but your question is still unclear to me. Since I’m not a mentor for this specific course, there’s no access to your lab as well. So, your question is now up voted.
That said, you’re welcome to click my name and message the notebook as an attachment while waiting for a course mentor to respond.

My question is : is there a typo in the code chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size] ?

Adding @gent.spah to see if he can help or add a course mentor with access to the entire notebook.

Correctness of the expression you’ve highlighted depends on how the variable i is incremented. If the 2nd chunk starts at 90th word, the code is correct.

1 Like

This is the section of code from the notebook : Iterate over text in steps of chunk_size to create chunks

for i in range(0, len(text_words), chunk_size):
    # Determine the start and end indices for the current chunk,
    # taking into account the overlap with the previous chunk
    chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]

Please try this:

text_words = list(range(100))
overlap_int = 2
chunk_size = 10

start_index = 0
end_index  = 0
while end_index < len(text_words):
    end_index = start_index + chunk_size
    text_chunk = text_words[start_index:end_index]
    start_index = end_index - overlap_int
    print(len(text_chunk), text_chunk) 

Output:

10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
10 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
10 [16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
10 [24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
10 [32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
10 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
10 [48, 49, 50, 51, 52, 53, 54, 55, 56, 57]
10 [56, 57, 58, 59, 60, 61, 62, 63, 64, 65]
10 [64, 65, 66, 67, 68, 69, 70, 71, 72, 73]
10 [72, 73, 74, 75, 76, 77, 78, 79, 80, 81]
10 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
10 [88, 89, 90, 91, 92, 93, 94, 95, 96, 97]
4 [96, 97, 98, 99]

Sorry, but I haven’t gone through this course either!

1 Like

Adding @lucas.coutinho

Hey all!

Nice observation!

The first chunk is different because it starts at position 0, so there’s no previous chunk to overlap with. When i = 0, the max(i - overlap_int, 0) ensures we start at the beginning of the document rather than trying to access negative indices, so only subsequent chunks actually grab those extra overlapping words from what came before.

This isn’t a problem though, since you still get all the content from the start at full chunk size, and the overlap only matters between subsequent chunks where you need to maintain context across transitions.

Best,
Lucas

1 Like