Overall dev set error after fixing incorrectly labeled data

Dear Mentors,

To avoid direct reference to the quiz, I abstracted the question like this:
Overall dev set error:20%
Errors due to incorrectly labeled data: 5%
Errors due to other reason: 15%
Question: it is true that if we fix the incorrectly labeled data we will reduce the overall dev set error to 15%?
I think the answer should be True. Because fixing labeled data is different than fixing other reasons (like image quality, etc.). In this case, there is no overlap that some error is due to both mislabeled data AND other reasons (other wise the overall dev set error would not match the sum of all the reasons). The hint says it is an estimation of a “ceiling”, but in my opinion, by fixing the label in dev set, it is guaranteed that those 5% would be reduced to 0%, and the overall error would be 15%.
What am I missing here?

1 Like

Even if all the labels are correct, you will still be subject to the “other reason” errors at 15% of all of the examples.

So removing the 5% of bad labels only improves the results by 85% of those 5%.

Hi TMosh,

Thanks for the quick feedback. I think you mean among those 5% mislabeled error, about 15% of them meanwhile caused by other errors. I.e. some missclassified samples are due to both mislabeling and other reasons. It makes sense.

How ever, the question is so formulated that the sum of all the errors equals the overall error, this indicates that there is no overlapping among errors, i.e. all the missclassified samples are caused only by 1 type of error, other wise the overall error should be smaller than the sum of all error.

You are assuming more than the question contains.

All right, thank you.