Training strategy as more real data becomes available

Hi!

I was wondering what the typical approach is as more real data from the field becomes available.

Let’s say the model performs OK after training on the initial training data to deploy it into the field.
Now the real world is generating more data and we are busy to check labels on these manually.
How do you integrate this data into the network?

Would I just add it to the training set and keep training from the current best model? My intuition says that new data will lead to slightly different gradients and therefore to the potential to improve the model over time.

–Flo

Hi @whnr ,
I believe you brought really good points for discussion. The concept of checking if the latest most actual data is different from the original data used for training is important and if this is different, it is called data drift. If the new data is different to the original (in case you are able to figure this out), you should either train with only the new data (if radically different) or incorporate the new data to the original dataset and retrain with that.
I hope this makes sense for you?

Yes! Thank you @carloshvp,

Course 4 / Week 2 had some more examples and the programming assignment about transfer learning (Alpaca recognition). That gave me a good intuition about what is needed or how this could work.

I just add a little to what Carlos said.

If there isn’t any significant concept drift in the new data, you should do well to just continue training your existing model on that new data. The pretraining usually helps learning and results in reduced variance.

In this case, we prefer not to use the original data in our training process because repeating examples introduces bias.

1 Like