Lecture on Evaluation: Concerning the Requirements of Example Data

In the lecture on how to evaluate the LLM application, I understood that the output of the LLM is evaluated through the following process:

  1. Generate QA data from a part of a document using QAGenerateChain
  2. Manually prepare the correct QA data
  3. Combine 1 and 2 to prepare the example QA data (ground truth data)
  4. Generate answers for example data questions using RetrievalQA + vector store (referred to as predictions)
  5. Compare the example A and the prediction’s A with QAEvalChain, if they are identical in meaning, they are considered correct

My question is, is it appropriate to treat the output of the QAGenerateChain in step 1 as the valid example data? I assume this would be the case if we could trust the results of the QAGenerateChain 100%, but I think there may also be cases where this is not the case.

Would anyone be able to provide some advice on this question?