How to deal with no results retrieval?

ILoong · August 18, 2023, 7:12am

Hi there, I watch all the courses for llm semantic search.
In my understanding if we use annoy or cosine similarity we should consider all data is near even we have small dataset. e.g. I have 1k database vector dataset. however if the query is not even near for that all 1k dataset. if we use annoy or cosine similarity, we should get results with big false positive results too.
so my question is, what method can we used to retrieve the no results or very very small score? what method to set the threshold on lexical search or on vector search?

Pooriya_Jamie · August 18, 2023, 4:06pm

Dear @ILoong ,
When using methods like Annoy or cosine similarity for semantic search, it is true that you may encounter false positive results if the query is not similar to any of the vectors in your dataset. These methods focus on finding the nearest neighbors based on similarity, so they might not be well-suited for scenarios where you want to retrieve “no results” or very low-scoring matches.

To handle such cases and set a threshold for lexical search or vector search, you can consider the following approaches:

Threshold-based filtering: After obtaining the search results using Annoy or cosine similarity, you can apply a threshold on the similarity score to filter out low-scoring matches. By setting a threshold, you can exclude results that do not meet a certain similarity threshold, thereby reducing false positives. This approach requires you to define the threshold based on your specific requirements and dataset.
Machine learning-based classification: You can train a machine learning model to classify the results as relevant or irrelevant based on certain features or similarity scores. By using a labeled dataset of relevant and irrelevant results, you can train a model to predict whether a given result is relevant or not. This approach can help you distinguish between actual matches and false positives.
Hybrid approaches: You can combine lexical search and vector search to improve the accuracy of your results. For example, you can first perform a lexical search to identify potential matches based on textual similarity. Then, you can use vector search or cosine similarity to rank and refine the results further. This hybrid approach can leverage the strengths of both methods and potentially reduce false positives.

It’s important to note that the choice of the method and threshold setting depends on your specific use case, dataset, and requirements. Experimentation and fine-tuning may be necessary to achieve the desired results.

Topic		Replies	Views
Issues with Precision in Searches for Specific Identifiers in Vector Databases AI Discussions ai-discussions	3	24	September 16, 2024
✨ New course! Enroll in Retrieval Optimization: From Tokenization to Vector Quantization News and Announcements dl-ai-learning-platform	3	76	October 4, 2024
How to find a good value for the threshold? NLP with Sequence Models week-module-4	1	497	February 23, 2023
Any recommended resources for combining text search and dense retrieval results? Large Language Models with Semantic Search	0	91	March 4, 2024
Decision boundary lesson: quiz Supervised ML: Regression and Classification week-module-3	3	23	November 26, 2024

How to deal with no results retrieval?

Related topics