Carrying out Error Analysis

Moutasem_Akkad · April 26, 2022, 4:34pm

Hi,

In the lecture, we are are analyzing a smaller data-set with 100 images. Let’s assume that we have an extremely large data set. How do we go about examining such large data set?

Back to our example, If there is 10Million images that are mislabeled, and 15% of them are dogs. Looking at even 100,000 images might not lead us to the dog misclassification even-though it is significant in this case.

MayankGhogale · April 26, 2022, 5:22pm

Hello sir,
Could you please elaborate more on what exactly your query is?
Thank you
Regards,
Mayank Ghogale

paulinpaloalto · April 26, 2022, 6:40pm

I think you should watch and listen to the relevant lectures again. Prof Ng does address the problems of very large datasets. You basically have to take a statistically fair sample of the errors and then analyze the causes. The scale of the subset that you analyze has to be practical, meaning a few hundred to a few thousand. If you have performed the selection fairly from a statistical point of view, then you should be able to discern some trends even from a very small (relatively speaking) subset. And of course you are starting by only sampling from the incorrect predictions.

paulinpaloalto · April 26, 2022, 6:46pm

In your example, if the mislabeled dog images are 15% of the errors, then they should be close to 15% of any randomly selected subset, right? They’re either 15% of the errors or they’re not. Of course if you only select 100 total samples to analyze, then there is the chance that you’ll miss things that are < 1% of the total errors. But 1% is 1%, right? You either care about that or you don’t. If 1% is a big deal to you, then maybe 100 is too small a sample size. The point is that we’re talking about statistical behavior here, so you need to think about it in a statistical manner.

You can also double check that your methodology is correct by doing several “random shuffle + select subset of 100” and see if the behaviors are different. If so, then maybe your random sampling is biased in some way.

Moutasem_Akkad · April 26, 2022, 7:01pm

Thanks Paul! That answers my question. I was worried about a potential systematic ordering of the data and I was wondering if there is any recommended statistical approach for sampling.

Shuffling and choosing random samples answers my question.

paulinpaloalto · April 26, 2022, 7:25pm

I’m glad the replies were useful, but I’ll bet you that Prof Ng actually said his own version of what I just said in the lectures. Might be worth another look!

Topic		Replies	Views
DLS Course 3 week 2 error analysis details Structuring Machine Learning Projects coursera-platform	2	651	May 13, 2021
Carrying Out Error Analysis and the "0.5% ceiling" Structuring Machine Learning Projects coursera-platform	1	531	February 17, 2022
Course 3 Week 2 - Cleaning Up Incorrectly Labeled Data Structuring Machine Learning Projects coursera-platform	1	524	October 7, 2022
Cleaning Up Incorrectly Labeled Data - ML Strategy \| Coursera Structuring Machine Learning Projects week-2 , coursera-platform	4	233	April 11, 2024
Error analysis in a regression problem Structuring Machine Learning Projects coursera-platform	1	522	May 13, 2022

Carrying out Error Analysis

Related topics