In course we have seen that in that Error Analysis we study the data in which error occurred and try to figure out all the types of data in which error occured and also we should try to add more data of such type in which the error occurs the most ( data which covers highest/majority part of data ). So the point is we should add specific data rather than general data to improve model’s performance.
But i have one doubt, adding more specific data will surely improve models performance for error occured in such data ( for eg adding ‘pharma’ data for mail spam classifier model when much error is occuring in pharma ) but
Won’t this affect the part of data that is already correct? ( the data that is not pharma in mail spam classifier and model is predicting them correctly, wont the model become biased towards pharma data and decrease performance) so the question is we add specific data to improve parts of data where we get errors but wont the data affect the part that was always correct? If yes then why , if no then why ? and will effect always be positive? If yes then why, if no then why?
I won’t say that it is a law of nature that adding data for worse class is definitely only going to have positive effect with zero draw-back. I don’t think so either.
This is an iterative process. This means that, after our first analysis, and carrying out all those actionables, we will need to do another analysis to figure out what to do next. If our actionables are always only going to have a lot of positive effects and no drawback, wouldn’t it be too good to be true? I think that’s why you have asked the questions.
Therefore, I think it is good that the lecture suggests us some possible actionables so that we can start making some changes, rather than no suggestion there, isn’t it? However, the thinking part and the decision-making part are left to us.
For example, which again is not guaranteed to be always true in every single ML case in every round of iteration of the development cycle, that if after adding photos for class A makes class B worse, but knowing that class B had performed better, would it be a sign that our model is underfitting so that expanding it a bit can allocate room for good features of both class A & B? Speculations like this requires thinking, experimentation, knowledge, and experience.
I think the lecture shared some knowledge and experience, but the thinking and experimentation parts have to be from us.
Hi Raymond, I still have a question for this example mentioned in error analysis. Since Andrew talked about high variance and high bias in previous lectures, I feel 100/500 misclassified (21 pharma misclassified) on CV set could be a case of high variance because the model is not doing well on the generalization part, although I have no idea what the human level is to be compared with. Like Andrew said previously, to solve high variance problem we need to either get more training examples or reduce the model complexity. However, Andrew’s solution to the problem in this case is “more data or new features”. The result is about CV set instead of training set. Doesn’t “new features” violate the rule of “reducing model complexity” to deal with high variance? I think this is the best ML course ever made, really hope you can answer my question
To begin with, it is always important to make that high-variance judgement by looking at both the training set and the CV set performance. It is high variance when it does way better in the training set than in the CV set. We don’t see the training set performance here.
I believe that human can do better than that. However, your statement actually supported the idea of high-bias, which is when both training set and CV set are not satisfying. Right?
Now, we are not sure if it had a high variance problem, so I am not going to comment it in that direction. However, we do know that the CV set’s performance on pharma email was poorer than human, so it is likely that we were suffering from a underfitting problem.
Andrew’s original words are
you may decide to collect more data but not more data of everything, but just try to find more data of pharmaceutical spam emails so that the learning algorithm can do a better job recognizing these pharmaceutical spam. Or you may decide to come up with some new features that are related to say specific names of drugs or specific names of pharmaceutical products of the standards you’re trying to sell in order to help your learning algorithm become better at recognizing this type of former spam
He specifically said to collect pharma-related samples or features, but NOT just any samples or features.
It makes sense to add new features to handle an underfitting case, because new features mean bigger network (the size of the first hidden layer is bigger if the input has more features), and with a bigger network, we certainly want more samples. On the other hand, collecting more pharma samples can balance the NN towards pharma, which is a good thing if pharma was originally under-represented.