Similarity search fails to capture product numbers

existme · July 16, 2023, 6:07am

Hi everyone,

I’m writing to you today because I’m having some trouble with contextual similarity search in my chatbot. I’m using the methods provided in the course, but the similarity search seems to lose accuracy as the number of documents increases.

I’ve tried two different vector databases: Faiss and OpenSearch, both with cosine similarity search and Euclidean distance. I’ve split each of the 400 documents into chunks of 200 characters, which results in about 150,000 chunks in total.

The main issue is that the similarity search returns relevant data, but it’s usually from the wrong document source. For example, let’s say we have 40 chunks from different products, and all of them have a chunk that says:

Product ABCXXX1 supports microSD/microSDHC/microSDXC cards.             source: ABCXXX1.html
...
Product ABCXX40 supports microSD/microSDHC cards.                       source: ABCXX40.html

If I do a similarity search with the query “What SD card is supported by ABCXX24”, I usually get the wrong chunks from irrelevant sources. I’ve tried some of the methods provided in the course, such as MMR or various ways of tuning index parameters, but the results aren’t good.

I’ve also tried to run the similarity search manually by creating an array of relevant chunks using the dot product, and the results of the similarity are correct. So I have a feeling that this is a problem with the vector database and not the embedding model.

Has anyone else run into a similar issue?

Thanks, Reza

P.S. I’m using the sentence-transformers/all-MiniLM-L12-v2 embedding model, and I’ve tried several other embedding models, but this one has been the best so far. However, the overall results are still not good.

Topic		Replies	Views
Guidance on Optimizing Text Similarity and Reporting with Transformers and Advanced NLP Techniques AI Discussions ai-discussions , introductions , project	12	117	November 9, 2024
Feedback on E-commerce Product Similarity Model for Fine-Tuning AI Discussions ai-discussions , natural-language-pro , project	1	65	October 3, 2024
? on using Metadata? LangChain for LLM Application Development	0	96	August 11, 2023
Storing question forms in vector databases for better similarity scores LangChain: Chat with Your Data	0	172	February 6, 2024
Question Answering Stuff Documents LangChain for LLM Application Development	2	121	July 17, 2023

Similarity search fails to capture product numbers

Related topics