Hello everyone,
In “Evaluation Part II” of “Building Systems with the ChatGPT API” short course,
we used factual content scoring by OpenAI evals project.
Here is description for each of the score:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
It sounds to me like the C
score is the best and D
is the worst you can get. Is that correct?
Do you know why the checks were ordered that way suggesting that A
is the best score and E
is the worst?
Thank you for you answer in advance!