Lets say I have a model f that I want to proof has surpassed human level performance.
I wonder how to actually proof that? I was thinking we need to have labelled data to proof that but normally labelled data is all human (expert) based. So I wonder where to get those labels from?
So your question is, how possibly can a model surpass human while the correctness of labels that the model is trained on is limited by human.
This is not a problem because human level is a quantity about a human, be it an average value or the max value. However, the correctness of label can be contributed by more than one human. In other words, while an average human can achieve 95% correctness, or the best human can achieve 98%, we can still have a set of labels that are 100% correct if we have put enough resources to perfect it. For example, we can have each sample labelled by 10 persons and we take the majority choice as the final label to get rid of error of a single person.
I realized that for some tasks you have actually labels that are not created by humans and for those you have the ground truth. When human operators are creating the labels then indeed we can have N operators look at a single case independently etc to get better than most humans.