Measuring human-level performance

In DLS/C3/W1 when talking about human-level performance, it is assumed that we can measure that against some gold standard, 100%-accurate performance (e.g. for the analysis of x-rays by radiologists).

In real life, where do we get that perfectly-classified dataset from against which we can then measure human-level performance (and the performance of our own code)?

Hey @renzodibona,

I guess our best hope to find some research on the task we are dealing with :slight_smile:
We can’t be 100% sure about the correctness of the gold standard, but assume it is accurate. Future research may improve previous results.