If I fine-tune a model or use RAG, I would need to evaluate the model on the specific prompts about the fine tuning or the retrieval through rag specific information. To use a public dataset to evaluate would not have this specificity, therefore I would need to put together an evaluation dataset.
Thanks for the response @TMosh ! If I’m implementing RAG for a client, would you recommend to ask that I ask such client a set of sample questions with their answers? Or should I let the user try the solution and add some scoring system to track the ones that were not correctly answered?
selection of evaluation dataset would be dependent on what kind of RAG application you are building. This in turn depends on your primary stockholders demands do’s and don’ts, features and investment.
more dataset is not always best for time and money related but can have better outcome. So balancing between all these criteria can help you decide how small dataset is enough to cover all the prompt features that doesn’t take too much time for your RAG application to use