Hi all,
I have 2 questions about crossencoder re-ranking:
The first is about the principle itself. I have not fully understood why it works, because isn’t the purpose of the vector databases to find the most relevant texts (by a distance metrics like cosine-similarity in embedding space)? Is it because we use a different model for reranking and thus get a “second opinion” of the relevance? Or is it because cross-encoders can do something fundamentally different than embedding models?
Second question is about multi-language support. I successfully query English texts by using the embeddings of a German or French query, using text-embedding-ada-002. Seems to work because in an oversimplified view, the language is just 1 dimension out of the 1535 dimensions of text-embedding-ada-002.
Now what about the used cross-encoder, “ms-marco-MiniLM-L-6-v2”, is it able to properly rank if the query and the texts are in different languages? If not, what multilanguage-crossencoders are out there?
Regards,
Thomas