Anomaly prediction using time series

Hi Everyone

I am working on a real world scenario to prediction anomalies in production application. For that we are feeding resource metrices (cpu/mem/error rate/traffic/latency) to the model and checking if there are anomalies.

What could be best way to go forward, as we have the data in BigQuery, ARIMA_PLUS we thought to choose. Do you think we are at right direction or there are possibilities if LSTMs or RNNs can get us better results.. ? how should we approach to it?

Hello @Bot001,

This is my first time reading about ARIMA+, but based on my understanding in ARIMA and figure 1 of this paper, I think the largest gain of RNN models is that you won’t be limited to only consider your features (metrics) separately. A big advantage of neural network is that the features are aggregated and projected non-linearly to some new dimension that correlates better (than each original feature) with your predicting variable.

If I were you that my data was already on Google Cloud and I was already familiar with ARIMA+, I would just go with ARIMA+ and see if the result was satisfying. Afterall, even if I were to try neural network, the ARIMA+ result would serve as a good baseline for my iterative process of improving the NN. I would also be interested in whether the components (see the figure 1) distilled by ARIMA+ might serve as a easier or cleaner input to train a smaller network, while I would also be very alerted about whether the residuals might contain information that I could miss out by neglecting it.

Cheers,
Raymond

3 Likes

Raymonds answer if profound, I cant add much more depth here. But if the data is aranged in tabular sources then ensemble machine learning algorithms might offer similar or even better performance than neural networks at smaller scale of resources, from what I have read.

1 Like

Thanks @rmwkwok and @gent.spah

I found a critical limitation on using ARIMA_PLUS that it always do forcasting on the latest last training data.. So before each prediction i need to train my models again and again.. Which does not seems feasible and even too costly. Because i need to do prediction each 30 minutes.

I got below recommendations from Claude

——————————–
Switch to Python ARIMA :-

New architecture:

  • Train weekly using Python ARIMA (not BigQuery ML)
  • Save models to GCS
  • At prediction time: Load model → Append recent 48h data → Forecast from NOW
  • Cost: ~$9/month
  • Effort: 7-9 hours (new code)

Advantages:

  • :white_check_mark: Forecast from current time (solves core problem!)
  • :white_check_mark: Weekly training (cheap)
  • :white_check_mark: True “30 min ahead” predictions

Disadvantages:

  • :cross_mark: Lose ARIMA_PLUS auto-seasonality features
  • :cross_mark: Need to manually tune SARIMA parameters
  • :cross_mark: More code to maintain

is thre any better approach or recommendations?

1 Like

Hello, @Bot001,

I wonder if you have access to the ARIMA+ trained parameters, because if you can reconstruct the forecasting model, you may as well “append the recent 48h data” while leveraging ARIMA+.


Source: the same paper

If I were you, the number one priority would be to verify the reconstructed model. It’s an interesting exercise anyway.

Besides, it would actually be very important to test the stationarity of the system that the model is describing, because it decides how up-to-date the model has to be, even though you can append as much data you like to your reconstructed model.

Cheers,
Raymond

Thanks so much @rmwkwok for taking your valuable time on it ..

We went through your suggested approach and tried somethings at our end. We need encountered some situations and need your suggestions on the doubts ..

Attaching some findings. Could you please take some time and help us through them.

Thanks so much in advance. (was not able to upload .MD file so converted that to .txt, please feel free to rename for a better look )

investigation.txt (16.1 KB)

Hello, @Bot001,

Thank you for the detailed work and analysis.

Let me start with explaining what my last message meant and it’s simple: reconstruct the maths that produces the forecast result. The message is simple but the work can take some time. For example, the paper says they used “Seasonal and Trend decomposition by Loess” (STL) for extracting seasonal (and trend) components and this means “can we reproduce this algorithm?” Your Markdown shows that you downloaded some parameters (e.g. p, d, q), but STL is a non-parametric algorithm which means you may not be able to just download it but you have to implement it.

However, from your work, I see you were trying to get straight to your demand, so let us not bother by the maths for the time being and let me share one thing I found with the help of the Google AI Mode.

It told me that we can export the ARIMA+ model in Tensorflow format. If so, you can run the model anywhere including your computer. You will need to check the model’s input signature for what it accepts, but from what I understand, you will have to provide three things:

  1. the horizon - number of future points to forecast
  2. the time series data
  3. a ISO format timestamp that represents the start time of your time series.

The number two is the most interesting part because, in principle, you can include post-training data and I believe this is your goal. My understanding is that the exported model should contain everything including the STL algorithm and trained parameters in a frozen state, so I believe the p, d, q values will be embedded there as well. However, I didn’t test any of these because I have not used ARIMA+, but I believe it won’t be too difficult for you to find the instructions and test these assumptions, because afterall, you can also ask the Google AI Mode :wink: .

If all of these work out, then we can evaluate how well an “old” model performs and decide how often to retrain.

Now I think we can briefly go back to the maths, but I will also have to say that, with all the above working out, there is no need to care about it at all. However, a good understanding of the maths will definitely help you make decisions including, at the very least, how to use the components as the input of your other model (RNN or GBDT).

I will quickly go through your questions at the end of the Markdown:

Q1: I believe exporting it to Tensorflow format worths a try, but you will need to find out if it is possible and, if so, how. Regarding API, in my conversation with the Google AI Mode, I found that this does not show up in your Markdown, but I will leave it to you to decide how useful it is.

Q2: I think you can somehow use them as the input for a model (not necessarily GBDT). For example, ARIMA+ does clean your data by removing outliners, spikes, and so on (for more, see the paper). My idea is that it is highly likely easier to train a model with cleaner data. That’s it.
For your usage examples in the question, I don’t intend to just say yes/no to them. The idea here is that, any contribution from the ARIMA+ to your RNN / GBDT may make the model learn easier but the degree to which cannot be told in advance and so can’t the tradeoff due to the extra cost of running ARIMA+. Therefore, whether and how you use ARIMA+ for your RNN/GBDT is really an iterative process that takes performance and costs into account, and I can’t just predict it here. Another reason I can’t predict it is because I just know nothing about the behavior of your data. I hope these make sense to you.

Q3: The reliability has to be told by good understanding of the system that generates the data, by experiment and by evaluation. For the first part, you need to consult the domain expert of the system: do the seasonal patterns have any reason to be stable? If there is a reason that it’s unstable, it’s unlikely that you can reliably use those from ML.EXPLAIN_FORECAST for deep into the future.

Q4: I think this leads us back to my answer for your Q1. Besides, I didn’t really read the whole paper, so you can’t rely on me for giving you everything public about ARIMA+. However, I think it’s only going to be useful to have a good understanding of the paper, and we can read it with the help of the Google AI Mode.

Lastly, you have mentioned about ensemble model and also LightGBT. Comparing LightGBT with RNN, the “same” part is that ARIMA+ can contribute, but the “different” part is that you will need to engineer features for LightGBT and this part can be challenging.

Cheers,
Raymond

2 Likes

Hello @Bot001,

I have updated my response. Just in case the forum sent you an email with my old response, please ignore that and come to the forum.

Have a good day!

Raymond

Thanks a lot @rmwkwok for taking time for this.

I am going to go over it today after work.. but just a hint to you… the data we are dealing with is the the application resource usage , traffic pattern, error pattern.. so its seasonal and we are trying to do anomaly predictions based on that :slight_smile:

Will try to go through thoroughly to your response and see what learnings i can make out of it which can help us further. thanks for being there and helping.. hope you won’t mind me bothering you again on it :slight_smile:

Thanks so much.

Sure, and good luck. If you can really save it in a Tensorflow format, it would really be interesting to see how much of its inside can be unveiled.

A quick response:

But then the seasonality is going to depend on the users and the associated environment, which means that, whether the seasonality itself is stable would still be an open question to me :wink:

For example, can your user base change? can the environment change?

The stability of the seasonality would be my question.

Cheers.

Mostly the seasonality here is like the festivals and weekends as this is retail site and what we are talking about are actual live applications. Users would keep on increasing (we hope that :slight_smile: ) the usage would increase gradually too.. but we are trying to find if there are anomalies so that we can raise incidents and that can be checked before hand.

Thanks so much.

1 Like

I see. I can only imagine many influencing factors around retail from market, economics, logistic to infrastructure. It’s a strategy to limit the scope to resource metrics, but you are still working on an interesting and challenging problem. Good luck!

Try something innovative for anomaly detection in multi-dimensional time-series , such as TS_AIDA, Google for: TS-AIDA, University of Utrecht

@rmwkwok / @Hansbakker

Tried multiple ways but somehow i am getting inclined towards timesfm model where it wont need re-training ( mostly we are needing it before predictions to get accurate results )

and while we are working with bigquery it seems more aligned.