In general, how do you come up with a human-level benchmark?

When Andrew discussed the process of diagnostics, in the problem of speech recognition, while explaining how to find out we have a high bias or variance, he came up with a benchmark that a human level error is about 10.8% (I’m not sure about the number). Thus, you should compare the accuracy of your model with this number. I’m curious, in general how do people come up with these numbers? In particular, I want to know how can I come up with a baseline model and a proper benchmark in my projects? Are there any repository of recorded such things?

Domain experts can do this. For example, in case of any medical related task, expert doctors can set a threshold for human accuracy…

Hello @couzhei ,
Welcome to the Discourse community and thanks a lot for asking your question here. I will do my best to give an answer to your questions.

The publications which release a new dataset tend to share the “human level error” along with the performance of existing state of the art models on their dataset. People do not come up with such answers.

Here is how human level error would be calculated in the field of optical coherence tomography angiography (OCTA): 100 OCTA images 50 of which has evidence of diabetic retinopathy is given to a team of expert ophthalmologists. Each ophthalmologist makes an estimate of diabetic retinopathy on the 100 OCTA images and the average of the errors that they make can be referred as the human level error. Later, they use state of the art methods on doing the same estimation.

Human level error is a niche term and I have only seen it on a few publications that release datasets.

There are various metrics for different tasks.

I would recommend you to view the following repository: GitHub - EpistasisLab/pmlb: PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

I hopeI was able to answer your questions. Please feel free to post a followup question.