Extra info about week2 - Cleaning up incorrectly label data

Hello everybody,
I would like to understand in depth how double checking only the examples that got wrong may result in higher bias.
Thanks in advance,
Kind regards,

Hello @S.hejazinezhad,

Did you mean correcting wrong labels will increase bias? What is the name of the video that you saw this?


Hello Raymond,
I wrote the lecture video on topic name. It is cleaning up incorrectly labeled data. It has been said if you only fix ones that your algorithms got wrong, you end up with more bias estimates of the error of your algorithm.
I would be glad if you explain how?
Kind regards,

Hello @S.hejazinezhad!

Thanks for pointing out about the title :smiley:. How silly I am…

And thanks for the quote since it’s pretty the center of my following explanation.

I guess your first post is suggesting that correcting labels for the wrongly predicted samples will result in “a model with higher bias”. However, I think what the video is saying is that, correcting it will result in “a more biased estimate of the error”. The difference is clear - the former is about the bias of the model whereas the later is about the error estimate of the model.

Let’s look at these example data:

id Truth label predict
0 1 1 1
1 0 0 0
2 0 1 1
3 1 0 0
4 1 1 0
5 1 1 0
6 1 0 1
7 0 1 0

Now, the error rate of our model is 4/8 (no. 4,5,6,7 are wrong predictions with respect to the labels).

The true error rate is 4/8 (no. 2,3,4,5 are wrong prediction w.r.t. the truth).

If we correct the labels for the wrongly predicted samples, then no.6 and 7 are corrected, and in that case, the model’s error rate will become 2/8 (only 4 and 5 are wrong predictions w.r.t. the labels)

Here, the error estimate is more biased because it is 2/8 while the truth is 4/8.

If wrong labeling is a random process, then there should be equal portion of wrong labels in both correct and wrong sides. Correcting labels only on the wrong side will ONLY reduce the error rate which is a biased act, because correcting labels on the right side can increase the error rate and we are not doing that.


1 Like

Thanks Reymond,
No no. I just said you can find the lecture name on the topic. I also sometimes miss the topics name!
clearly explained. I have a silly question. Watching the lecture, I thought Prof was talking about the bias concept. I understood your point, I think. But the point was different than the bias expect (I mean the one from bias/variance concepts).
I do not know now whether both are the same or different. If they are the same, is there a possibility of underfitting with increasing the bias just due to correcting only the wrong examples?

Hello @S.hejazinezhad :smiley:,

In short, I think making biased correction (only for the wrongly predicted samples) and then re-training a new model can result in a model that has more bias, or one that has less bias.

Here is how I would think about this:

We have a truth, and we want to model that truth.

We have a dataset, and we want it to be a perfect representation of the truth so that through which we can model the truth unbiasedly.

Now, our dataset is always not perfect, and it can be imperfect in 2 ways: (1) it has noise in the data, or (2) there is some systematic error in the data (for example, we tend to almost always mislabel yellow cat as tiger but only yellow cat).

For (1), examples can include random mislabeling, or random mismeasurement of feature values. Note that it is important to be random. If you want a more concrete example, imagine the truth to be y=2x + 0, and our (X, y) data look like this “(1, 2.1), (2, 3.9), (3, 5.9)”. For this kind of imperfectness, it makes us more vulnerable to overfitting (because to correctly model these data, we need a curved line instead of the true straight line). Therefore, correcting this imperfectness will make us less vulnerable to overfitting.

For (2), examples can include a malfunctioning mesaurement. For example, if the truth is again y=2x + 0, but the measurement of y is always offset by 1 so that the data looks like “(1, 3), (2, 5), (3, 7)”, then you are going to end up with this model: y=2x+1 which is biased from the truth due to the data. For this kind of imperfectness, correcting it will reduce bias.

Now, back to your question. At the beginning, our mislabeling might be a random process, which only make us more vulnerable to overfitting (because the model needs to be more tricky to “call white black”). If the action of label correction itself is unbiased, it is very likely we will end up with a dataset that is also randomly mislabeled but with less mislabelled data. In this case, I would say, we can improve high variance problem.

However, if our action of label correction is biased (because we are choosing to only correct wrongly predicted labels), then we are converting a randomly mislabelling problem into a systematic mislabeling problem. Therefore, we are going to have a higher bias.

We might not have a higher bias if initially there is already a systematic mislabeling problem and our action reduces it.