Evaluation Part II

As part of evaluation part II, Andrew talks about 2 design patterns. The rubric pattern and ideal answer pattern. The rubric pattern is understandable where we ask the model to evaluate completion based on fixed set of questions.
But in ideal answer pattern , we evaluate the completion based on ideal answer. The example used in ideal answer mentions about specific products, categories and details about them. The user prompt used also asks about those specific products. Does this mean, in order to follow the ideal answer pattern , we will have create ideal answers for each of our development set or test set user queries. And suppose we are able to tune the prompt to based on evaluation to work on test set, but then how do we evaluate the response in production scenario where we may not have ideal answers for every customer service queries that the model has to respond to.