Issues with Precision in Searches for Specific Identifiers in Vector Databases

Hello everyone,

I’m currently facing a specific issue in an application that uses a vector database for semantic searches. I have a list of documents with numbered identifiers in the format “2024XXX/0000000000”. When attempting to search for a document by its specific identifier, the system often fails to retrieve the correct document. Additionally, similar service names (e.g., Service A and Service B) are being confused during searches.

Has anyone else encountered a similar problem? If so, what solutions did you implement to address these challenges?

I’m considering using exact filters or some form of post-processing, but I’m open to other suggestions and best practices that could help improve the accuracy of my application.

Thans in advance for your help!

Hello @Ingo_Djan,

if I understand correctly, in the first paragraph you describe that you have a collection of documents, each of the documents has a unique identifier in the format “2024XXX/0000000000”, and you attempt to retrieve the document based on the identifier using semantic search?

If that’s the case, it is expected that you would run into the issues you describe. Semantic search is typically designed to retrieve documents based on meaning or content similarity rather than exact matches to specific identifiers. If your goal is to retrieve a document using an identifier like “2024XXX/0000000000,” a traditional keyword search or exact match query would indeed be more suitable.

Confusing similar concepts (such as the services you mention) is also a common issue in semantic search. This happens because underlying models focus on the meaning and context of the terms rather than their exact wording.

Generally speaking, if you have the names of the exact terms for which you are trying to retireve information, it is better to perform exact query match, it is simpler, computationally less expensive and more reliable in terms of results.

I don’t know your use case, nor the framework you are using, but if sematic search is necessary, maybe combining it with keyword-based search would be a more appropriate solution. For instance, first filter documents based on the exact service name and then apply semantic search to retrieve what else you are looking for.

Best!

Thank you for your response! You got it right, that’s exactly what I’m aiming for. I have a chatbot doing RAG on internal documents locally.

I understand a simple term-based search could solve the issue. Currently, I’m parsing the user’s query to extract numbers and names to search in the metadata, filter, etc. (ChromaDB). However, I’m hitting performance bottlenecks with complex questions where documents are merely referenced, written differently (e.g., 12 of 24, 12/24…), or something else.

I figured maybe someone else might have encountered this too. Or maybe I’m trying to kill a ant with a gun.

Thanks again, Anna!

You 're welcome!

Regarding the same values that are written differently, perhaps you could simplify the process by creating a mapping of all the MM/YY variations that are equivalent, especially if there are only 3-4 types of variations.

I’m not sure if you’re already using them, but regular expressions could be helpful in retrieving the different variations with the same meaning.

Best