RAG Module 2 bm25_retrieve graded exercice error

Hi,

This exercice is confusing.

Hint2 is

Make sure the corpus is indexed. This can be done by preparing the retriever with the document data before performing retrieval. Use the .index method of BM25_RETRIEVER.

However, it was done two cells before (in the example, and is also done one cell before the graded cell).

I ran the cell before the graded cell and was able to print TOKENIZED_DATA

The tokenized query result in the graded cell is

Tokenized(
  "ids": [
    0: [0, 1, 2, 3, 4]
  ],
  "vocab": [
    'about': 3
    'gdp': 4
    'news': 2
    'recent': 1
    'what': 0
  ],
)

Then, when I call BM25_RETRIEVER.retrieve() using the tokenized query and top_k, I got the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 # Output is a list of indices
----> 2 bm25_retrieve("What are the recent news about GDP?")

Cell In[13], line 28, in bm25_retrieve(query, top_k)
     24 print(TOKENIZED_DATA)
     26 # Use the 'BM25_RETRIEVER' to retrieve documents and their scores based on the tokenized query
     27 # Retrieve the top 'k' documents
---> 28 results, scores = BM25_RETRIEVER.retrieve(tokenized_query, top_k)
     29 print("B")
     31 # Extract the first element from 'results' to get the list of retrieved documents

File /usr/local/lib/python3.11/site-packages/bm25s/__init__.py:866, in BM25.retrieve(self, query_tokens, corpus, k, sorted, return_as, show_progress, leave_progress, n_threads, chunksize, backend_selection, weight_mask)
    864     else:
    865         index_flat = indices.flatten().tolist()
--> 866         results = [corpus[i] for i in index_flat]
    867         retrieved_docs = np.array(results).reshape(indices.shape)
    869 if return_as == "tuple":

File /usr/local/lib/python3.11/site-packages/bm25s/__init__.py:866, in <listcomp>(.0)
    864     else:
    865         index_flat = indices.flatten().tolist()
--> 866         results = [corpus[i] for i in index_flat]
    867         retrieved_docs = np.array(results).reshape(indices.shape)
    869 if return_as == "tuple":

TypeError: 'int' object is not subscriptable

I think the major issue is how you called the retrieve method. The top_k value should actually be assigned to a keyword argument k.
Check out the documentation here.

Here’s an example given:

import bm25s

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
]

# Tokenize the corpus and index it
corpus_tokens = bm25s.tokenize(corpus)
retriever = bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)

# You can now search the corpus with a query
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query)
docs, scores = retriever.retrieve(query_tokens, k=2) ## check this line, you should assign to k
print(f"Best result (score: {scores[0, 0]:.2f}): {docs[0, 0]}")

# Happy with your index? Save it for later...
retriever.save("bm25s_index_animals")

# ...and load it when needed
ret_loaded = bm25s.BM25.load("bm25s_index_animals", load_corpus=True)


1 Like

Also check this post:

many thanks, I corrected the syntax and was able to complete module 2

1 Like