list1 = semantic_search_retrieve(query, top_k)
list2 = bm25_retrieve(query, top_k)
# Combine the results using reciprocal rank fusion.
top_k_indices = reciprocal_rank_fusion(list1, list2, top_k)
-
If we want to retrieve top-k relevant documents, should we retrieve more than top-k documents for list1 and list 2 then apply
reciprocal_rank_fusion; otherwise, we might miss out good candidates. If top-k is 3, index=0 ranks 1 from semantic and ranks 4 from keyword (list2 won’t include index 0); index=1 ranks 3 from semantic and ranks 3 from keyword; index=1 ( 1 / 63 + 1/ 63) will win over index = 0 (1 / 61 + (1/64)) in this case. When I apply CrossEncoder to re-rank, I always retrieve 3-5 times of top-k before re-rank. I think that the same principle should apply to both cases. -
RRF does not take alpha or beta (the weight on semantic search vs. keyword search) into consideration. In another word, it always gives the same weight to semantic search and keyword search. Is that a good approach and well adopted in industries?
Here is what I did in my Agentic RAG project.
norm_sem_scores = self.search_semantic_matches(query)
norm_bm25_scores = self.search_keyword_matches(query)
if norm_bm25_scores is None:
return None
combined = self.alpha * norm_sem_scores + (1 - self.alpha) * norm_bm25_scores
# I did not set up a guardrail here. Re-ranker seems to be a better guardrail
top_k = np.argsort(combined)[::-1][:n_results]
I keep full normalized scores for both, It’s better to post select top-k rather than pre-select top-k.