As you mentioned if the data drift happens during the inference period. you said you would update the model, how will you update the model?
do you mean you will combine the new data which caused the data drift along with the already trained data? and perform new feature engineering. okay if it so, the new data will not have the target column right? how will you consider that? will it be good ground truth data?
correct me if I am wrong.
Hi @optimizing_wieghts,
I can relate to your question. Combining both datasets makes sense if old data distribution is still relevant to the problem statement. For e.g. speech recognition still required for adult voices in addition to young voices and yes new dataset has to be labeled before it is used in retraining of the model. there could be smarter ways to label some of those but those will be application dependent.
@rajgupt thanks very much for your quick turnaround. would you mind give me an example scenario with structured data? say we are going to predict the house price. I have trained the model and deployed it into the production environment. Assume we are doing batch prediction in the weekly interval, then before retraining the model, I do verify the distribution of the model that trained with train data target column and predicted column from the new data. If I notice a significant difference in the distribution, I do prefer doing retraining. But, the question is here, whatever we have predicted it’s not an accurate prediction. so how we can consider that data to retrain the model?
Actually, I have confused you with my approach. could you please let me know your approach to update the model? what is the smarter way to label?