Bird Recognition in the City of Peacetopia : Citizen Data

Hi,
To the following question, I am not totally getting why should we add the citizens’ data to the training set. I only see it harming than improving anything. The pictures taken by citizens may be very different from the security cameras. It would improve the prediction if we want to predict on pictures taken by citizens as well, which is not the case here. Also, the only way I see this improving the system is if we add some % of this in training, dev and test set, and not one of them only.

The explanation says that:

  1. “Sometimes we’ll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production.” This is true that sometimes we don’t have much choice, but here we have choice of including the citizens’ distribution or not.

  2. “Also, adding training data that differs from the dev set may still help the model improve performance on the dev set.” How will it help? Can you tell me some cases where it will help improve performance on the dev set?

2 Likes

Hey @storm95,

The main intuition here is that adding a considerable number of new training examples, even from a different distribution, may still help learning. We also want our dev and test sets to be close as possible to the true data distribution, because we use these sets to evaluate our model, and we want to evaluate on examples that will come on inference time (e.g. pictures taken by security cameras).

Please remove the screenshot with your quiz answer and notes on explanation to the wrong answer, that’s against the rules.

Thanks @manifest . I am unable to find any edit option so that I can delete the screenshot. Can you guide me how to remove it?