Very bad performance in question-answer neural emb system

I heard in the wild some reports of very bad (irrelevant) results being returned by neural QA systems based on LLMs with retrieval architectures as described in this short course.

Let’s brainstorm reasons for bad or random results. Here are mine – can you think of more?

Re Some people report (eg that reddit commenter at /r/LanguageTechnology was asking why poor performance is being shown by his QA app based on embs and vector db).

Did he remember to rerank after running the query? Rerank is where the LLM chooses the best results directly using its language understanding.

Did he remember to create indexes at all? You can’t just insert data to the vector db, you also need to create indexes after inserting everything.

Did he remember to train the LLM (or perhaps few shot prompt engineering) on application specific positive and negative examples to improve rerank performance for a particular application?

Did he choose a pretrained model that already has some built-in knowledge of his application domain? He should do some experiments to see if the LLM knows stuff about his domain, if anything.

Did he choose the correct languages consistently for his application? Some models do not default to your application’s particular human language.

That’s all I’ve got.

WHat do you know about it? Thanks for reading.