In DLS/C3/W1 when talking about human-level performance, it is assumed that we can measure that against some gold standard, 100%-accurate performance (e.g. for the analysis of x-rays by radiologists).
In real life, where do we get that perfectly-classified dataset from against which we can then measure human-level performance (and the performance of our own code)?
