I have difficulties understanding how we would get human level performance if the labels are (as would be a common approach) added by humans? Isn’t it then always 0% error?
Assuming the cat example: our baseline is humans labelling the cat pictures. So the human level performance on the dataset is 100% as it defines the baseline.
Otherwise how could we identify the ones where humans were incorrect?
Interesting question! Well, as Prof Ng mentions in the lectures on this topic, there is no one single “Human Performance”. He describes at least three levels:
A general population of non-expert users.
A single human expert (e.g. a radiologist looking at medical images).
A committee of human experts (a team of expert radiologists).
This is in order of increasing performance (decreasing error) of course. And all of those are => the Bayes Error, by definition.
But the interesting point you raise, which I had never previously considered in this regard, is that the labels themselves are human generated by definition essentially. So why don’t the labels define human performance and how would we ever know if some of the labels are actually incorrect. I think it comes down to the vagaries of the way this type of labelling task is done. For a simple case (requiring no special training or expertise) like recognizing cats in an RGB image, the large scale labelling would usually be done by some kind of “mechanical turk” process involving a lot of people hired from the Internet and the process has some level of intrinsic error caused by carelessness given the large volume of the data. But a person who is not in a hurry and was trying to assess the error rate of such a dataset could perform a manual evaluation and estimate what the percentage of mislabelled images is. Then they could fix the incorrect cases they found, but that probably doesn’t scale if the cardinality of the dataset is O(10^5) or greater. So maybe all you can say is that Labelling Error >= Human Error >= Bayes Error.
Here is how I thought about the specific example of cancer patient dx …
We know some patients have cancer, verified by their ill health/experience and we have their radiology images. This label is provided by a “human” but also supported by the experience of the patient i.e. their illness. This data is labeled correctly by a process which includes human observation, patient experience, several consults, second opinions etc., over the course of many years possibly…
Now, we take this data and hand it off to the “expert radiologist committee” and ask them to make a prediction. They may not (actually probably will not) catch some of the cases. I could be wrong, but that is what classifies as “human error”.
I guess the difference is that initial human “label” isn’t determined based on a few seconds of examination. But the predicted human “label” is.
Thanks for your answers!
Especially those insights on the realistic labelling approach really help making it more clear. And I think the cancer example is a good example, where we can re-evaluate the labels given by human experts when we look back in time (which I didn’t even think about with the cat example).
Here’s an interesting read that says," it
can cost up to two times more to use a crowd, because the company distributes the same task to multiple people and often requires a consensus model with multiple people completing or reviewing tasks to achieve passable quality."
And in the case of labeling medical images, it requires professional expertise.
I would think human-level performance has to be measured against ground-truth.
For example, in the example above, if you want to know how well doctors can detect cancer from radiology images - then you’ll first start by collecting images from known cancer patients and non-cancer patients/ healthy people, and then show their radiology images to experts to see how well they are able to identify.
Similarly, for the cat/ no-cat image example - I would think, the proper way would be to first directly take images of different objects with varying resolution, zoom, clarity, lighting and so on (so you know ground truth) and then have people try and classify those images, to establish baseline human performance.