Hello
say we have a trained model on a structured data, and this model is already deployed on production environment. at some point we decided to add new features to our data, that only available to the newly created data, ( for example we started asking the users about their date of birth). the question is: what is the best practice to handle this type of changes:
- do we have to retrain the whole model from scratch, as we do if we have completely new data ? or is there some way to just update our old model ?
- also in regard of our old data ? how should we replace the missing feature in a way doesn’t hurt the model performance ? (the date of birth in the example )
Thank you
There is no way to directly change an existing model to make use of a new feature. You can combine outputs of multiple models to make a prediction. For instance, assume that you were creating a model to approve a bank loan:
- Old model outputs prediction,
P1
- New model outputs prediction,
P2
- Combine
P1
and P2
to produce the final outcome, say, P3
. For the sake of simplicity, this could be an average of of P1 + P2
.
When asserting that generated featues are good ones, you’ve built a model that outperforms the existing model either in storage / compute / performance domains.
There are 2 more choices to make if your API supports only new data moving forward:
- Chunk all old data and use only new data to create a fresh model and replace the existing model. This means that the number of new datapoints could be far fewer when compared to existing data.
- Provide defaults for old data points for new features (say, use the most frequently occurring date of birth value) and build a new model with all data.
In the event that your API needs to support calls with and without the new feature, you have to address the option of having both models and invoke the appropriate one based on the type of data.