Hi,
This exercice is confusing.
Hint2 is
“Make sure the corpus is indexed. This can be done by preparing the retriever with the document data before performing retrieval. Use the .index method of BM25_RETRIEVER.”
However, it was done two cells before (in the example, and is also done one cell before the graded cell).
I ran the cell before the graded cell and was able to print TOKENIZED_DATA
The tokenized query result in the graded cell is
Tokenized(
"ids": [
0: [0, 1, 2, 3, 4]
],
"vocab": [
'about': 3
'gdp': 4
'news': 2
'recent': 1
'what': 0
],
)
Then, when I call BM25_RETRIEVER.retrieve() using the tokenized query and top_k, I got the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 2
1 # Output is a list of indices
----> 2 bm25_retrieve("What are the recent news about GDP?")
Cell In[13], line 28, in bm25_retrieve(query, top_k)
24 print(TOKENIZED_DATA)
26 # Use the 'BM25_RETRIEVER' to retrieve documents and their scores based on the tokenized query
27 # Retrieve the top 'k' documents
---> 28 results, scores = BM25_RETRIEVER.retrieve(tokenized_query, top_k)
29 print("B")
31 # Extract the first element from 'results' to get the list of retrieved documents
File /usr/local/lib/python3.11/site-packages/bm25s/__init__.py:866, in BM25.retrieve(self, query_tokens, corpus, k, sorted, return_as, show_progress, leave_progress, n_threads, chunksize, backend_selection, weight_mask)
864 else:
865 index_flat = indices.flatten().tolist()
--> 866 results = [corpus[i] for i in index_flat]
867 retrieved_docs = np.array(results).reshape(indices.shape)
869 if return_as == "tuple":
File /usr/local/lib/python3.11/site-packages/bm25s/__init__.py:866, in <listcomp>(.0)
864 else:
865 index_flat = indices.flatten().tolist()
--> 866 results = [corpus[i] for i in index_flat]
867 retrieved_docs = np.array(results).reshape(indices.shape)
869 if return_as == "tuple":
TypeError: 'int' object is not subscriptable