In the lecture on how to evaluate the LLM application, I understood that the output of the LLM is evaluated through the following process:
- Generate QA data from a part of a document using QAGenerateChain
- Manually prepare the correct QA data
- Combine 1 and 2 to prepare the example QA data (ground truth data)
- Generate answers for example data questions using RetrievalQA + vector store (referred to as predictions)
- Compare the example A and the prediction’s A with QAEvalChain, if they are identical in meaning, they are considered correct
My question is, is it appropriate to treat the output of the QAGenerateChain in step 1 as the valid example data? I assume this would be the case if we could trust the results of the QAGenerateChain 100%, but I think there may also be cases where this is not the case.
Would anyone be able to provide some advice on this question?