While I liked that the course provides a structure to evaluate the agent, I feel the following are unanswered questions for me:
• great that you have these metrics for a specific question; are these to be derived for all the questions ever asked to the agent in production or are these just to be computed at development time?
• Based on the metrics let’s say I adjust certain prompts, those prompts might improve metrics for specific question but might not be relevant for other questions - how to address this?
• if my agent is used by 300 users, is it recommended to compute these metrics for every question, and can there be improvements made across all the questions?
•how often can GPA for an agent change? what sort of continuous monitoring is recommended for the performance of the agents?