Hi everyone,
I’m writing to you today because I’m having some trouble with contextual similarity search in my chatbot. I’m using the methods provided in the course, but the similarity search seems to lose accuracy as the number of documents increases.
I’ve tried two different vector databases: Faiss and OpenSearch, both with cosine similarity search and Euclidean distance. I’ve split each of the 400 documents into chunks of 200 characters, which results in about 150,000 chunks in total.
The main issue is that the similarity search returns relevant data, but it’s usually from the wrong document source. For example, let’s say we have 40 chunks from different products, and all of them have a chunk that says:
Product ABCXXX1 supports microSD/microSDHC/microSDXC cards. source: ABCXXX1.html
...
Product ABCXX40 supports microSD/microSDHC cards. source: ABCXX40.html
If I do a similarity search with the query “What SD card is supported by ABCXX24”, I usually get the wrong chunks from irrelevant sources. I’ve tried some of the methods provided in the course, such as MMR or various ways of tuning index parameters, but the results aren’t good.
I’ve also tried to run the similarity search manually by creating an array of relevant chunks using the dot product, and the results of the similarity are correct. So I have a feeling that this is a problem with the vector database and not the embedding model.
Has anyone else run into a similar issue?
Thanks, Reza
P.S. I’m using the sentence-transformers/all-MiniLM-L12-v2 embedding model, and I’ve tried several other embedding models, but this one has been the best so far. However, the overall results are still not good.