Measuring & improving accuracy of LLM based conversational app

One of the courses described this process with an example of generating SQL statements from natural language input. In that example one of the best practices was to provide some good examples.
I am wondering what would be the approach for a conversational app where there are multiple rounds of back and forth. Let’s say this is an advisory app which helps you navigate in certain topics. Generating good examples might be more tricky here as a I can only judge the success of a conversation at the end, seeing where we got to and won’t be able to judge from a single question/answer.

So, anyone with some good hint or pointers on this one?