Hi
I have been working on a project where I have to forecast some value for every week for each different product. I am seeing data drift or data distribution shift. Would you recommend putting automatic training on that so that it would capture every time new data comes in? In that way, we will get pretty acceptable forecasting. If you have some other suggestion on that please do share with me.
Hi p7137, welcome to the community.
I think that’ll work but it might not be very efficient. Often while deploying a model we need to balance between a metric like accuracy and non-functional requirements like number of predictions that can be made per unit time.
Training takes a lot longer than inference, so adding a training step every time new data comes in will increase your inference time significantly.
Alternatively, you could batch the data first, say collect ‘n’ new samples before triggering a training or just monitor the statistical properties of the incoming data, like for e.g.: has the mean changed in the last ‘n’ samples, has the variance changed in the last ‘n’ samples (as compared to the last training set)? If some of these statistical properties have changed then go ahead and trigger a re-training.
‘n’ mentioned above can be anything like 100, 1000 etc. depending on the use-case or the rate of the incoming data points.
Thanks for your response
I have one more doubt. If the actual value is 120 and predicted comes 116. I am kind of noticing this trend in a couple of weeks’ data. That all the predicted values showing deviation ± 5 from actual values for some products and ±1 for some other kinds of products. Would you consider this a good forecasting model or there is a need to consider other hidden factors? Should I consider doing some feature engineering here with data points?
Your suggestion would be appreciated
Thanks
Hi p7137,
I sincerely apologize for such a delayed response.
As with any analytics project, we should first fix the target metric (see rule #2 and #13 at rules for ML).
This should ideally be done before we start processing the data. This could be something like, the model prediction should always be within 5% of the true value. What range is acceptable might come from a domain expert or may even come from some regulatory document.
Once the metric is defined, it’ll be straight-forward to mention whether the predictions you are seeing are good enough or needs more work. May be ± 5 deviation is good enough, may be not (it’s very problem specific). Metric will help you to conclude on that.