New 1000 images after model development (train/dev/test), where to add?

Here is a rephrased question:

After developing the model I found out that my model is doing worse in the real world because of a new bird species. What should I do first if I have 1000 new images for that bird?

Do I add it to the training set by augmenting it?

Do I split it and add it to the dev/test set?

The problem is that in the lecture told me that if the model does worse in the real world, I should change cost func/metric/dev/test.

But in the quiz, it was suggested that I augment the data and add it to the training set.

What am I missing here?

You need to include the new bird species into training somehow. “start over” is the most robust method.

Personally, I’d throw all of the data back into one bucket (the original data along with your new data), and start over with new training, validation, and test sets.

You might have to use a less robust method if the data set is so large that re-training is problematic.

Your response implies that the subtle difference is the arrival of a “new species”. If it weren’t the case (no new species but just new 1000 images and the model is doing worse in the real world) could I have changed the dev/test set for further improvement?

In addition to Tom‘s reply:
I would also recommend to understand why your model is performing worse on the new bird species. I believe this can be a crucial point and also it could help potentially for other new species you might want to address later on. E.g. you should ask yourself:

  • Are there significant differences in the data which can either be caused by the new bird species but maybe also due to some other circumstances in data acquisition?
  • could you do something about it like data preprocessing to prepare the data well to that your model can generalise better?

Best regards
Christian

That’s not a subtle difference. That’s a classification your model was not trained to detect. It’s a significant change in what you expect the model to do.

@TMosh @Christian_Simonis

Allow me to explain what I understood so far:

  1. If a model is not doing in the real world (even though there are no “new species”), Andrew suggested using a new metric, changing the dev/test set, and/or cost function. However, adding this new data to training is not an issue.

  2. Model is not doing well in the real world because of a “new species”. From your replies, I got that I should throw out the data into one bucket ( if re-training is not problematic) and “start over” because the model is not generalizing well.

  3. Model is not doing well in the real world because of a “new species”. I can also try data augmentation for training/test/dev (not adding the “new species”) to see whether the model can generalize better for the “new species” in the real world.

Here is another option (rephrased) that I got from the quiz:

  1. Define a new metric (using a new dev/test set) by considering the new species. This completely avoids adding the new species data to training.

I’ve some additional confusion.

From the lecture we see that dev set and test set should come from a same contribution distribution to keep the target aligned. That makes sense.

But isn’t the train set teaching the model? If we only throw data into dev/test sets, not the train set, how can the model learn from those new cases?

The training, dev, and test sets should all share the same statistical distribution - not contribution.

Thanks for the reply. Sorry that was a typo. I meant distribution though.

My confusion comes from two places:

  • The quiz tells that “adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution”.
  • Andrew also emphasizes in the lecture that dev and test should have same distribution.
  1. Do these mean dev/test might have a different distribution from the train set?
  2. In the statement “adding training data that differs from the dev set may still help the model improve performance on the dev set”, is the training data added assumed to share same distribution as dev set?

I don’t have any good answer as to why the lecture says what it does.

This could be the case, e.g. if you want wo deploy your model for object identification like a cat. Then this should manifest in your test set, too. Then the test set could be tailored to the business problem we want to solve - in our example: cats! It’s important to use new data (e.g. pictures of cats) that the model never saw before. Still the training data could also include other trainings examples (of cats but also other related animals).

My take: not necessarily! Adding new training data could help the model to learn abstract patterns better (like paws or so [which is definitely relevant to our cat example but many other animals as well]). So there are many relevant data and characteristics to be learned from other pictures w/ animals like tigers, lions, leopards, … :leopard: , that could help to learn relevant features to identify a cat more accurately. So, the model could learn how edges and contours make a „paw“ or „whiskers“ or other features that are important to identify a cat and low level features like edges are hierarchically combined and enhanced to describe more advanced patterns to finally form objects, which contribute to the classification if we see a cat on the picture or not. This thread might be worth a look: What makes the different neurons in a layer calculate different parameters? - #7 by Christian_Simonis

Hope that helps!

Best regards
Christian

In addition to your second question: it depends of course also how many high quality data you already have in your train set (with relevance to the business problem). In general one could say that the residual benefit of adding new training data decreases (or flattens out), the more highly quality data you already have in your train set.

On the other hand: if you only have a little number of cat pictures in you train set, probably you will benefit very much adding new train examples with cats (and potentially also related animals, see previous post).

Hope that helps!

Best regards
Christian

@TMosh @Christian_Simonis

Thanks for all the detailed explanations.

Here’s what I understood so far according to the lecture, quiz and this thread (I’m doing in @plutonic18 's way since I found this a good manner)

  1. For Dev/Test sets:

    • New data are welcome only if they are close to real world or production.
      Reason: other “non-real-world” or “non-production” data introduce noise.
    • If new data are added, make sure they’re added to both sets.
      Reason: the dev/test sets should come from a same statistical distribution to keep our goals unique.
  2. For Training set:

    • New data are also welcome to the training set, even though they’re from different distribution from dev/test, they’re “non-real-world” or “non-production”.
      Reason: as long as those data are “valid”, the model can learn more features at different levels from them.
    • If new data set is too scarce compared to the training set, it’s still not bad to put into training set, but it won’t help too much. Instead, using them to adjust the dev/test sets and evaluation will have a bigger impact.