How could the reference be determined in such cases🤔

Dear all
Assume that we have a medical image dataset.
The labels–malignant or benign–were determined by experts through investigating medical images. However, some labels will come out differently after the biopsy check(which are determined by the same expert after reviewing the biopsy report).
I was wondering for machine learning purposes in medical imaging diagnosis, which label should be considered as Human-Level Performance–the HLP-- or the reference as all labels were defined by an expert?

Thanks in advance

Interesting. I would imagine in practice you need to record both of them, but when comes to the model, you use the latter one, as human level performance didn’t say you can’t use “tools” - in this case, imaging.

Thanks for your reply,
So in our goal for this purpose, diagnosis through medical image, could we consider both references as ground truth?
The 1st of which is reaching to hit the accuracy target bar of HLP which results from checking medical images by an expert, and the 2nd ground truth could be the target that comes out after biopsy investigating by that expert.