Thanks for your answer, @paulinpaloalto.
I figured the answer would vary from case to case. The EEG example I mentioned corresponds to the motor imagery problem, in which a person imagines the movement of some part of the body and the job of the model is to classify which part of the body was imagined based on the brain signals.
The data for this problem are usually collected with the following protocol:
(1) A human subject sits down on a chair in front of a screen.
(2) The screen prompts the user to imagine a movement (e.g. the left hand or the right hand).
(3) The EEG signals are recorded for a couple of seconds and are then labeled according to the prompted movement.
The reason why this task is possible is because the motor cortex behaves in a characteristic manner when we imagine a movement. This behavior can be identified with EEG as it causes a predictable change in the frequencies of the signals.
However, it is usually impractical for a human to manually classify these signals. Every brain is different, so the pattern is not the exact same for everybody. Plus, as I said, the signal-to-noise ratio of these data is very low. The noise comes from multiple sources, including signal readings from unrelated parts of the brain and from muscle movement, user fatigue during the experiment (which leads to bad signals or incorrect labels), and even electrical interference from powerlines.
Some of the noise can be mitigated using band-pass filters, which remove from the signals frequencies that do not belong to brain activity.
But, from my knowledge of the machine learning literature in motor imagery EEG classification, this is about as far as preprocessing goes. Specifically, most deep learning papers just band-pass filter the signal and leave the rest to a neural network (most commonly a CNN). Some papers also convert the signal to its time-frequency representation instead of directly using the raw signal as input.
More traditional approaches include using common spatial patterns to extract features from the signals based on their covariance matrices and then using an LDA classifier. But the preprocessing is similar to what I described above.
Mind you, this is an active research problem and my knowledge about the literature is still limited. But I wanted to know if there are some guidelines I can refer to when dealing with the problem I described.
Besides, the motor imagery EEG problem is somewhat uncommon. I’d also be interested in knowing more about what to do in more usual problems, as the marketing recommendation problem mentioned by Professor Ng.
Could you give me some examples of problems where the machine learning model far surpasses human-level performance and how to decide what to work on next in such scenarios?
That is an interesting method, @Christian_Simonis. But I’m not sure I can ‘measure’ the ground truth label and propagate the error like that in my example (refer to my answer to @paulinpaloalto).