Adding more data after error analysis

How do we know that adding more data for a certain category that poorly performed in error analysis will increase model’s accuracy?

Is that we don’t need to check whether model is high bias or high variance?

Hello Seungjun @Seungjun_Lee,

We definitely want to check if the model has high bias or/and high variance, that’s why when we say a model is poor on certain category, we can additionally talk about whether it is poor on both the training & cv set (high bias), or it is poorer on the cv set than it is on the training set (high variance).

When it has a high bias, chance is that adding more data for a certain category can help that category but not the overall performance, because your neural network may not have enough freedom (or number of neurons) to express all crucial features for distinguishing sample of one category from samples of another.

When it has a high variance, adding more data can help your neural network be less sensitive to the noise which is not a common and not an useful feature among your samples.

Therefore, adding data for a certain category won’t be harmful to that category, but in order to maximize the benefit brought to us by those extra samples, we need to know whether our model is underfitting because in that case we would want to have a bigger neural network to accomodate all the useful features.

Lastly, doing the high bias and high variance check is one thing, but examining the data in that poorly performed category is another thing. The better we know about the difference between our training sample and the real world sample, the more likely we are able to introduce actually useful samples to the training set. For example, if it is a image recognition model and it poorly performs on cat images, and in our analysis we find that our training samples never show the side view of a cat, but the real world samples do, then we know we need more side views of cat.

Cheers,
Raymond

2 Likes