Bayes error, human-level performance and overfitting (structured data)

During the lecture, it mentioned Bayes error, human-level performance (which is considered as proxy for Bayes error) and overfitting. They are clear for unstructured data. But they are not clear for structure data. I have some questions below:

  • How can we identify Bayes error or human-level performance for structured data? (maybe there is a way but I am not aware of)
  • If we don’t have Bayes error or human-level performance for structured data, then how can we decide to improve avoidable bias or variance after we trained the model?
  • Overfitting question, during the lecture the following example is given:
    Assume Optimal Bayes error around 0%
    |Train set error |1%|15%|15%|0.5%|
    |—|—|—|—|—|
    |Dev set error |11%|16%|30%|1%|
    ||High variance |High bias |High variance & high variance |Low bias & low variance|

It is clear for the outcome, but it is not clear that how much difference between train set error and optimal Bayes error could be considered as significantly difference? If we have train set error 5%, shall we still consider it as high bias or consider it as acceptable?

Hey @Qiong_Wu1,
Welcome to the community. For your first question, consider the example that Prof Andrew explained in the lecture. When the model outperforms the human-level performance, it becomes harder to further improve the model since there is no human baseline as a reference. The same holds true for structured data. If you can’t obtain the human-level performance for some structured data, then it will be harder to improve the model (but not impossible).

Keeping that in mind, in some cases, it might be possible to obtain human-level performance for structured data as well. For instance, an experienced analyst can look at a dataset consisting of 10s of factors about different ad-campaigns, and give each of the ad-campaigns a relative score that highlights their performance. Now, we can compare this relative scores with the actual performance of these ad-campaigns, and find the human-level error. To add on top of it, I haven’t come across any systematic way to find the human-level error for structured data till now, but if there is one, you can take a look at Google Scholar.

As for your second question, it is the same scenario when a model outperforms the human-level performance for unstructured data. In that case, we can’t calculate the avoidable bias too, and hence, whatever methods/strategies that you apply in the case of unstructured data can also be applied in the case of structured data.

A train set error (TSE) of 5% can be considered as high bias, high variance or both or even acceptable depending on the avoidable bias (AB) and dev set errors (DSE), i.e., its acceptance as any one condition is relative. Consider the below cases;

  • TSE = 5%, DSE = 10%, AB = 4% | High Variance
  • TSE = 5%, DSE = 6%, AB = 1% | High Bias
  • TSE = 5%, DSE = 10%, AB = 1% | High Bias & High Variance
  • TSE = 5%, DSE = 6%, AB = 4% | Acceptable, but may be improved further depending on the application

So, there is no threshold like 2% or 5% for which you will consider TSE as high bias, high variance or acceptable or both. It all depends on your requirements and the performance of your model on train/dev/test sets and comparison with human-level performance (if it exists). I hope this helps.

Regards,
Elemento

Dear Elemento,

Thank you for your quick response. It becomes easier if we have human-level performance. However, for some cases such as fraud detection even the human do not have experience. That’s why we want to use machine learning algorithm to help us. If human-level performance is not exists, do you have any suggestion on model performance check and overfitting detection?

Hey @Qiong_Wu1,
In my opinion, a good way to guide our analysis in the case when we don’t have any human level performance is to use an estimate of what we want our model to achieve. Consider that we have a fraud detection system, and we want it to achieve an error of 5%. Now, based on our train-set and dev-set errors, we can determine whether our next efforts should be to decrease bias or variance or both. And determination of this desired level is something that should be set very carefully, most probably by someone having considerable domain experience.

Still, I would like to request @kenb @TMosh and @paulinpaloalto to take a look at this, and suggest any better methods that they may have come across during their experiences.

I hope this helps.

Regards,
Elemento

Hi Elemento,

Thanks. I think the difficult is to estimated an error especially for structured data. How can we guarantee the estimated error is realistic or statistically reasonable, since we even don’t know what error we expected?

Hey @Qiong_Wu1,
I definitely agree with your point that it’s difficult to guarantee that the estimated error is realistic or statistically reasonable. However, let’s say you are a ML engineer and you are building a fraud detection system for a company. Let’s say that the company has a non-AI based fraud detection system that gives them an error of 10%, and they want you to build an AI system that will give an error of 5% at max.

Now, from the company’s perspective, I won’t spend lots of lots of dollars, just to make a system that only improves the error by 1-2%, and hence, it’s my job as a ML engineer, to make a model having an error rate of 5%, otherwise, the company will go elsewhere.

However, from your perspective and experience, an error rate of 5% won’t be realistic, so, you both agree with an error rate of 6.5%. Now, whether this is achievable or not, that you can only find out by making a system.

Note that in this case, the company won’t care about whether it has structured or unstructured data. If AI can’t prove to be better than it’s current system, it won’t simply give your company the contract to build one.

This is one of the key differences in traditional software and AI-based software. In the case of a traditional software, one can almost always deliver upto the expectations, but in the case of a AI-based software, the expectations are hard to define, and even if they are defined, there is always a great deal of uncertainty whether they will be met or not.

To further consolidate my point, I would like to cite something from the latest edition of the Batch:

Batch, May 18, 2022

Compared to traditional software that begins with a specification and ends with a deliverable to match, machine learning systems present a variety of unique challenges. These challenges can affect the budget, schedule, and capabilities of a product in unexpected ways.

How can you avoid surprising customers? Here’s a non-exhaustive checklist of ways that a machine learning system might surprise customers who are more familiar with traditional software:

  • We don’t know how accurate the system will be in advance.
  • We might need a costly initial data collection phase.
  • After getting the initial dataset, we might come back and ask for more data or better data.
  • Moreover, we might ask for this over and over.
  • After we’ve built a prototype that runs accurately in the lab, it might not run as well in production because of data drift or concept drift.
  • Even after we’ve built an accurate production system, its performance might get worse over time for no obvious reason. We might need help monitoring the system and, if its performance degrades over time, invest further to fix it.
  • A system might exhibit biases that are hard to detect.
  • It might be hard to figure out why the system gave a particular output. We didn’t explicitly program it to do that!
  • Despite the customer’s generous budget, we probably won’t achieve AGI.

That’s a lot of potential surprises! It’s best to set expectations with customers clearly before starting a project and keep reminding them throughout the process.

I hope this helps.

Regards,
Elemento

Hi Elemento,

Thanks. It is a nice discussion. :smile:

1 Like

In addition to using human-level performance or previous model within the company as reference, you can also use vendor models as reference, or models used by other competitors in the industry.

If it’s an entirely new industry, you may also estimate a baseline performance required to achieve a break-even point; kind of like saying, this business will only make money if we can get at least X% accuracy.