In the first video of DLS Course 3 week 2, Andrew mentioned that if 5 of 100 errors were from dogs, then the maximum improvement that we could achieve by fixing dog recognition is 5% * 10% (error rate) = 0.5%.
Is there an assumption that the distribution of 100 examples that chosen from mistaken examples are the same as all the others?
Because if there are actually more than 5% of dog errors in all the errors, maybe we can achieve more than 0.5% improvement, how should I understand this problem?
Hi @NocturneJay ,
You are very right, that we could have really bad luck in selecting the 100 mislabeled dev set examples and get a totally different distribution to the actual total dev set with errors. This is going to make that, as you mention, we can reduce (potentially) the error by more (or less) than 0,5%.
This kind of difference in distributions is always going to be present when we do sampling. The only way to make sure we have the exact same distribution is to pick all errors and manually evaluate them. But with very big datasets, this may be thousands or tens of thousands, which becomes impractical and not necessary.
If you randomly sample a portion of them, you can get a “feeling” on what is worth to spend our effort to improve accuracy efficiently and what not. It is perfectly possible that sampling “only” 100 errors, we get a deviation from the actual error distribution, but it will not change the overall behavior.
To put it in an example:
Lets imagine the 5 misclassified dogs in the 100 sampled errors do not correctly represent the actual error distribution. Lets imagine, instead of 5, it should have been 10 (that would be a really different distribution, double amount of dogs in the errors). But even if this is the case, we still have 90 in 100 that are something else. Unless we have a big number of classes (e.g. 10 or more), it is still worth to try to improve other class different to dogs.
There are really good maths available to calculate confidence intervals and other fancy statistical stuff I am not an expert on, but I don’t think we need that. The idea is that with this simple method you can check which class will have a bigger impact in the accuracy, so that we are more efficient in improving the overall model.
I hope this makes sense to you?
Thank you @carloshvp.
I think that the point is we can not afford to evaluate all the samples wrongly predicted. Thus this random sampling of errors can give us advice on choosing which idea to implement in a few minutes.
It worth trying.
1 Like