In the lecture, we are are analyzing a smaller data-set with 100 images. Let’s assume that we have an extremely large data set. How do we go about examining such large data set?
Back to our example, If there is 10Million images that are mislabeled, and 15% of them are dogs. Looking at even 100,000 images might not lead us to the dog misclassification even-though it is significant in this case.
Could you please elaborate more on what exactly your query is?
I think you should watch and listen to the relevant lectures again. Prof Ng does address the problems of very large datasets. You basically have to take a statistically fair sample of the errors and then analyze the causes. The scale of the subset that you analyze has to be practical, meaning a few hundred to a few thousand. If you have performed the selection fairly from a statistical point of view, then you should be able to discern some trends even from a very small (relatively speaking) subset. And of course you are starting by only sampling from the incorrect predictions.
In your example, if the mislabeled dog images are 15% of the errors, then they should be close to 15% of any randomly selected subset, right? They’re either 15% of the errors or they’re not. Of course if you only select 100 total samples to analyze, then there is the chance that you’ll miss things that are < 1% of the total errors. But 1% is 1%, right? You either care about that or you don’t. If 1% is a big deal to you, then maybe 100 is too small a sample size. The point is that we’re talking about statistical behavior here, so you need to think about it in a statistical manner.
You can also double check that your methodology is correct by doing several “random shuffle + select subset of 100” and see if the behaviors are different. If so, then maybe your random sampling is biased in some way.
Thanks Paul! That answers my question. I was worried about a potential systematic ordering of the data and I was wondering if there is any recommended statistical approach for sampling.
Shuffling and choosing random samples answers my question.
I’m glad the replies were useful, but I’ll bet you that Prof Ng actually said his own version of what I just said in the lectures. Might be worth another look!