Regarding to 'reciprocal rank fusion' in C1M2 Assigment

list1 = semantic_search_retrieve(query, top_k)
list2 = bm25_retrieve(query, top_k)
# Combine the results using reciprocal rank fusion.
top_k_indices = reciprocal_rank_fusion(list1, list2, top_k)

  1. If we want to retrieve top-k relevant documents, should we retrieve more than top-k documents for list1 and list 2 then apply reciprocal_rank_fusion; otherwise, we might miss out good candidates. If top-k is 3, index=0 ranks 1 from semantic and ranks 4 from keyword (list2 won’t include index 0); index=1 ranks 3 from semantic and ranks 3 from keyword; index=1 ( 1 / 63 + 1/ 63) will win over index = 0 (1 / 61 + (1/64)) in this case. When I apply CrossEncoder to re-rank, I always retrieve 3-5 times of top-k before re-rank. I think that the same principle should apply to both cases.

  2. RRF does not take alpha or beta (the weight on semantic search vs. keyword search) into consideration. In another word, it always gives the same weight to semantic search and keyword search. Is that a good approach and well adopted in industries?

    Here is what I did in my Agentic RAG project.

norm_sem_scores = self.search_semantic_matches(query)
norm_bm25_scores = self.search_keyword_matches(query)
if norm_bm25_scores is None:
return None
combined = self.alpha * norm_sem_scores + (1 - self.alpha) * norm_bm25_scores
# I did not set up a guardrail here. Re-ranker seems to be a better guardrail
top_k = np.argsort(combined)[::-1][:n_results]

I keep full normalized scores for both, It’s better to post select top-k rather than pre-select top-k.

1 Like

hi @BlackHippo086

Yes as much I understand RRF that’s correct approach when it comes to using re-ranker, but the baseline one need to make sure what your RAG architecture is trying to accomplish.

If it is precision, cross encoder is scaling larger documents (k>100) can cause incorrect re-ranking especially when initial retrieval is highly effective and/or if the initial retrieval is poor and fails to retrieve relevant documents, in such case re-ranker won’t solve the issue.

If the re-ranker was trained on general data and is applied to a highly specialized domain. For example specific legal or medical datasets, it can produce lower-quality rankings than the initial retriever.

Adding instructor of the course, for his suggestion

@Zain_Hassan what do you have to say!!!

Module 3 might have already answered my 2nd question. Most companies do use hybrid search and customize alpha weight for semantic search vs keyword search. Weaviate has that option for hybrid search. Thanks

RRF does not take alpha or beta (the weight on semantic search vs. keyword search) into consideration. In another word, it always gives the same weight to semantic search and keyword search. Is that a good approach and well adopted in industries?