Hi there, I watch all the courses for llm semantic search.
In my understanding if we use annoy or cosine similarity we should consider all data is near even we have small dataset. e.g. I have 1k database vector dataset. however if the query is not even near for that all 1k dataset. if we use annoy or cosine similarity, we should get results with big false positive results too.
so my question is, what method can we used to retrieve the no results or very very small score? what method to set the threshold on lexical search or on vector search?
Dear @ILoong ,
When using methods like Annoy or cosine similarity for semantic search, it is true that you may encounter false positive results if the query is not similar to any of the vectors in your dataset. These methods focus on finding the nearest neighbors based on similarity, so they might not be well-suited for scenarios where you want to retrieve “no results” or very low-scoring matches.
To handle such cases and set a threshold for lexical search or vector search, you can consider the following approaches:
-
Threshold-based filtering: After obtaining the search results using Annoy or cosine similarity, you can apply a threshold on the similarity score to filter out low-scoring matches. By setting a threshold, you can exclude results that do not meet a certain similarity threshold, thereby reducing false positives. This approach requires you to define the threshold based on your specific requirements and dataset.
-
Machine learning-based classification: You can train a machine learning model to classify the results as relevant or irrelevant based on certain features or similarity scores. By using a labeled dataset of relevant and irrelevant results, you can train a model to predict whether a given result is relevant or not. This approach can help you distinguish between actual matches and false positives.
-
Hybrid approaches: You can combine lexical search and vector search to improve the accuracy of your results. For example, you can first perform a lexical search to identify potential matches based on textual similarity. Then, you can use vector search or cosine similarity to rank and refine the results further. This hybrid approach can leverage the strengths of both methods and potentially reduce false positives.
It’s important to note that the choice of the method and threshold setting depends on your specific use case, dataset, and requirements. Experimentation and fine-tuning may be necessary to achieve the desired results.