What to do when there is no human-level performance baseline?


Throughout the course, we’re taught how to assess our models and decide how to improve them next. This decision usually involves evaluating how much of the error is due to “avoidable bias”, “variance”, and “distribution mismatch”.

But, as Professor Ng says in the “Surpassing Human-level Performance” lecture, “some of the tools you have for pointing you in a clear direction just don’t work as well” when the performance of our models surpasses the human baseline, which makes progressing more difficult. However, he doesn’t talk much about what to do in these cases.

I believe this problem also occurs when we’re dealing with problems where there is no human baseline. For example, I work with the classification of electroencephalogram (EEG) data that usually have low signal-to-noise ratios (and no human baseline, as this is not a natural perception problem). In this scenario, one of the problems is that it’s difficult to tell how much of the classification error is due to intrinsic noise and how much of it is because of the model.

So, what to do when we don’t have a human baseline? Is there another way of estimating the Bayes error? What are some other ways of effectively deciding what to work on next in this scenario?

From what I remember of what Prof Ng says about this, I think he’s mainly talking about the case in which human performance happens to be much lower than what an ML model can do. In those cases, the comparison between Human Error and the performance of the model just isn’t very useful or relevant. My guess is that the reason he doesn’t say anything very specific about those cases is that it is very “situational”. E.g. in your EEG example, it would be useful to know more details. What do you mean that there is no human metric? How do you know if a particular classification is correct or not? How is that evaluation done in the case that you don’t have an ML model? There must be some process and metrics. What do those look like? Where do you get your “labelled data” if you want to train an ML model? If the signal to noise ratio is low, then you’re starting with a pretty fundamental problem, right? Maybe you need to invest more effort in the signal processing side of things before you graduate to the ML approach. Mind you I’ve never studied signal processing, but only know that there are deep waters there (I was a math grad student once upon a time, so I can spell Fourier but never actually got to any SP applications).

Another way of estimating the bayes error, e.g. if ground truth is measured in a technical, objective way is:

  • taking the specification of the sensor and propagating the error or uncertainty according to label definition
  • usually reference measurements with a representative sample size (e.g. in case of a normal distribution w/ n > 30) can help to verify the propagated error, e.g. by utilizing a laboratory standard. In general, the maximum likelihood estimation can be used to determine parameters of the distribution. (If n <= 30, we can still estimate student t distributions if some assumptions hold true, such as symmetry of distribution etc.)

So the question would be how you measure the ground truth label of the relevant application, the electroencephalogram (EEG)? I guess we would need the specs of the electrodes here and understand the measurement principle (direct vs. indirect) for a correct error propagation.

1 Like

Thanks for your answer, @paulinpaloalto.

I figured the answer would vary from case to case. The EEG example I mentioned corresponds to the motor imagery problem, in which a person imagines the movement of some part of the body and the job of the model is to classify which part of the body was imagined based on the brain signals.

The data for this problem are usually collected with the following protocol:
(1) A human subject sits down on a chair in front of a screen.
(2) The screen prompts the user to imagine a movement (e.g. the left hand or the right hand).
(3) The EEG signals are recorded for a couple of seconds and are then labeled according to the prompted movement.

The reason why this task is possible is because the motor cortex behaves in a characteristic manner when we imagine a movement. This behavior can be identified with EEG as it causes a predictable change in the frequencies of the signals.

However, it is usually impractical for a human to manually classify these signals. Every brain is different, so the pattern is not the exact same for everybody. Plus, as I said, the signal-to-noise ratio of these data is very low. The noise comes from multiple sources, including signal readings from unrelated parts of the brain and from muscle movement, user fatigue during the experiment (which leads to bad signals or incorrect labels), and even electrical interference from powerlines.

Some of the noise can be mitigated using band-pass filters, which remove from the signals frequencies that do not belong to brain activity.

But, from my knowledge of the machine learning literature in motor imagery EEG classification, this is about as far as preprocessing goes. Specifically, most deep learning papers just band-pass filter the signal and leave the rest to a neural network (most commonly a CNN). Some papers also convert the signal to its time-frequency representation instead of directly using the raw signal as input.

More traditional approaches include using common spatial patterns to extract features from the signals based on their covariance matrices and then using an LDA classifier. But the preprocessing is similar to what I described above.

Mind you, this is an active research problem and my knowledge about the literature is still limited. But I wanted to know if there are some guidelines I can refer to when dealing with the problem I described.

Besides, the motor imagery EEG problem is somewhat uncommon. I’d also be interested in knowing more about what to do in more usual problems, as the marketing recommendation problem mentioned by Professor Ng.

Could you give me some examples of problems where the machine learning model far surpasses human-level performance and how to decide what to work on next in such scenarios?

That is an interesting method, @Christian_Simonis. But I’m not sure I can ‘measure’ the ground truth label and propagate the error like that in my example (refer to my answer to @paulinpaloalto).

1 Like

Hi, Gabriel.

Thanks so much for your detailed background information about the motor imagery problem. I should first give the disclaimer that I am just a fellow student, not a real “practitioner of the art”: I don’t have real world experience with ML and all I know is what I’ve heard Prof Ng discuss in these various courses. So take what I say with the appropriate dosage of salt :nerd_face: and skepticism …

I do recall that Prof Ng called out the marketing recommendation problem as one good example of a case in which algorithms seriously outperform humans. But here again, I think it’s the case that how you decide what to do to improve performance is going to be pretty “situational” and it’s not really clear how much will generalize to your EEG problem. In that recommender system case, my guess would be that there is no good way to know what the Bayes Error actually is and you just keep trying to improve performance. I think I remember hearing other people posit this case as one in which “online learning” is really effective: the idea is that you are continuously getting feedback from the responses of users (do they click or not and if so what), that you can create a feedback loop that incorporates new data into the training set and retrains incrementally (sort of a variation on Transfer Learning) on the fly.

But with the EEG motion imagery problem, I think we’re getting ahead of ourselves here. I think you’ve got some much more fundamental questions that you need to solve first. If I’m correctly understanding your problem description above, then to me it sounds like the two top level problems are:

  1. The data is unique to each person.
  2. The data is very noisy (low signal to noise ratio).

ML is good at finding patterns in data and can even in some extreme cases recognize things that an expert human can’t see. One famous such example is the case in which a Google AI model learned to determine the sex of a patient from examining retinal scans, which opthalmologists previously thought was not possible. But if the data in the motion imagery case is different for every experimental subject, then that strikes me as a pretty fundamental problem: how can it learn a generally applicable solution if every person’s signals are different? Someone once asked on the earlier version of the forums whether they could train an ML model to predict the stall speed and angle of attack of an airfoil in a wind tunnel. But the point is that an ML model only knows what it can learn from the data that it has seen: if you feed it an airfoil with a shape it’s never been trained on, then the answer is not likely to be useful. So the high level principle is that if you’re trying to solve a physics problem, you need a physics approach. So invest your effort in fine grained aerodynamic simulation software, rather than training an ML model.

Mind you, I’m not saying that analogy necessarily is relevant to the EEG motion imagery case. But it seems that you need some approach to “normalize” the inputs in a way that generalizes. One thought that occurs to me is that you want to have a baseline for how each subject’s brain works in general and then somehow compute the differences between the baseline responses and the motion imagery responses. E.g. start with a baseline training set that tries to sample how the person’s brain works in general. Give them other prompts and sample the EEGs:

  1. Imagine your mother’s face
  2. Imagine the taste of peanut butter
  3. Imagine the sound of a dog barking
  4. Imagine the sound of one hand clapping
  5. What is 2 + 3?
  6. There’s a tiger behind that bush!
  7. What are you going to have for breakfast tomorrow?

And so forth … Then you have a general model for how each person’s brain is wired and you can somehow use that to compute the “difference” with the responses to the motion imagery prompts.

Of course you also then need some signal processing on the EEG outputs in general (band pass filters, Fourier Transforms) to deal with the noise issues.

Some kind of “embedding model” seems like a plausible approach. If you leave the inputs in the time sequence domain, then you’ll need it to be a sequence model (Course 5 here). The “embedding” idea is covered in the Face Recognition section of Course 4 Week 4. The model can learn its own idea of what the elements of the embedding are by being trained on the various (normalized) signals.


That’s very Impressive Paul sir!

Thanks for your thoughtful response, @paulinpaloalto.

Regarding the first part of your answer, at the end of the day, it seems that the way out is to apply domain knowledge to identify and fix the shortcomings of the model in solving your problem.

Regarding the second part of your answer, let me first restate the current state of motor imagery EEG classification. In real-life applications such as motor rehabilitation, what usually happens is that the user goes through a calibration phase, in which motor imagery EEG data from the user is collected to train the model. After the model is trained, we can successfully use the motor imagery EEG classifier. And it works quite well, some subjects have 80-90% accuracy in left/right-hand classification. Other subjects have very bad results, around 50-60% (in fact, my original question arises from this particular situation).

From here, we have two active research challenges: first, increasing classification accuracy overall; second, reducing or eliminating this calibration step.

Regarding the last part of your answer, I’d say your suggestions are on point. In fact, the approach you proposed is quite similar to the one I’m currently investigating in my research. That is awesome :).

Just to conclude this post, I’ll just mention that there’s some interesting research in transfer learning and domain adaptation approaches for motor imagery EEG to address exactly the problem you mentioned, of “normalizing” the inputs in a way that generalizes. The approaches in the literature range from divergence-based methods (like maximum mean difference) to adversarial methods (like conditional domain-adversarial networks). But this topic is certainly underexplored, so there is certainly a lot more to try!

1 Like

I wish you the best with your research! It sounds like an area where DL techniques will be useful, but there is obviously a lot of hard work to get from the current state of the art to real solutions. Please keep us posted and send us a link when you publish the paper!


I have a question regarding human-level performance. How to get human level error rate? Usually annotators label the data, and then split them into train/test/dev. The evaluation is focused on algorithm’s performance. So one way to get human level error rate is possibly through this procedure:

  1. Data annotation by 1 or more annotators
  2. Another group of people (for example, domain expert) re-annotate the annotated data, and calculate the error rate based on the expert’s re-annotations?

This would add more cost to the project. And for most open source ML projects or ML competitions, train/dev/test datasets are more than enough to evaluate the systems. So I am still not quite clear why human-level error rate is that important, because the assumption is that human error is close to 0%.

Is this true?

1 Like

The procedure you describe for estimating human error is pretty similar to what we learned in Prof Ng’s discussions on this topic. In one of the big examples he gave, the topic was medical imaging. If I recall correctly, he points out that you could have the images labelled by three different humans or groups of humans:

  1. Individual doctors who are not specialty radiologists.
  2. Individual radiologists.
  3. A team of expert radiologists.

You would expect to get the highest accuracy from a team of experts analyzing the images of course.

Human error is not necessarily close to 0%: it all depends on the task. Prof Ng discusses this in the lectures in this section. There are some things the human visual system is very good at (e.g. distinguishing whether an animal in a picture is a cat or a dog), but he gave a number of examples in which human error is pretty high. I think “recommender systems” was the main example he used of that. Whether human error is important or not is “situational”, as we discussed earlier on this thread. In the medical imaging case, you are talking about people’s health, so a wrong judgement can have serious consequences. If the proposal is to take the trained radiologist out of the loop and trust your ML system to analyze CAT scans for tumors, you’d better be able to prove strongly that your system is at least as good as humans at the same task. If the purpose of your system is just to suggest which products a web surfer is interested in buying, then human error is not that relevant.

The other point that Prof Ng makes in the lectures is that Human Error is also used as a proxy for Bayes Error, which (by definition) is the lowest possible error that is achievable on a given task. Of course that is a theoretical concept and there is not usually a way to prove what that value is. But you know (also by definition) that it is <= human error. That is something that (with some effort and expense) you can sample, which will give you an upper bound for Bayes Error. If you are successful enough that your system performs better than human error, then you will need to decide a) whether your current performance is good enough and b) whether you can think of ways that might potentially improve the performance further and at a cost you can afford.

There may be some situations where human error rate needs to be found out, like medical applications. For some areas I know of, there is no such a step to explicitly calculate a human error, because this will add more burden to data production, especially in a dynamic environment. As long as annotator agreement reaches a certain level, i.e. 75%, then the annotation can be trusted.

1 Like