Hi,
I would like to know if using the Moving Average Centered solution in order to predict new values it is not considered data leakage? Because, since we are averaging a middle number of a given series with values from the past and from the future we are indeed using information that it is not available when predicting.
Thank you
Hi, @Francisco_Pereira !
That’s true. At inference time you just can’t do that. You can only use your past values for averaging your current point.
2 Likes
Hi @Francisco_Pereira,
welcome to the community and thanks for your question!
A moving average filter is called causal if the output does only depend on historic or present inputs, see Section 8.4.3.
This is usually the case when it comes to forecasting algorithms.
Let’s take an ARIMA approach for example. Here it’s fair to think in the following way for the time series prediction:
- the prediction horizon is relevant. E.g. if you make a prediction today for July 2023, the prediction horizon is ~6 months
- [a fair benchmark for this prediction horizon for this prediction benchmark should be considered, see also this thread].
- when you evaluate your model performance, e.g. not before 6 months (let’s call this our test set) preferably also considering the benchmarks, there is no data leakage since no information from our test labels were available when the prediction was made.
Best regards
Christian
1 Like
Thank you for the answer. For what I’ve seen in the first week of Sequence Models course the instructor is using the centered rolling window. So we can assume, that in real world application that would be not possible since we do not have access to future data. Tell me if I’m missing something
Thanks
1 Like
Exactly. If you were to deploy that model, you simply cannot use future data as it does not exist yet 