Why is validationMAE not following the downward trend of validationLoss?


I tried to apply the methodology given to other real data (more than 8’000 daily values).

Naive forecast
The naive forecast obtained on those data is MAE = 6.08

Best model
After having tried more than 100 models/hyperparameter optimizations, the best model I come up with (see juypter notebook) gives MAE = 5.68, when run on 1000 epochs

I am not satisfied, as I was hoping that I could reach a much lower MAE when compared to the naive forecast. This raises a number of questions.


  1. Why is the validationMAE not following the drop in validationLoss ?
    1.1) Hybrid with 1 unidirectional LSTM → 100 epochs = MAE: 5.72

    1.2) Hybrid with 1 bidirectional LSTM → 100 epochs = MAE: 5.77

    1.3) Hybrid with 2 bidirectional LSTM → 100 epochs = MAE: 5.82

  2. What is the best loss and validationLoss shape we should strive for ?
    Should they be almost parallel and smooth like in 1.3 ? ¨
    But they gave a worse MAE…

  3. What could be done to improve performance ?
    The best model
    0_bestTSModel.ipynb (85.1 KB)
    run on 1000 epochs has MAE = 5.68 (compared to naive forecast with MAE 6.08).
    It seems to be overfitting, but my regularization trials helped a bit but not that much.
    I cannot increase the number of input data in the time serie which is already > 8000.

  4. Should normalization be attempted ?
    The values are integer count data ranging from 0 to 70.

Thank you for any suggestions or thoughts,

Hi Raymond @rmwkwok !
How are you doing?
Would you have any suggestions or thoughts about this ?

Hello Manu, good to talk to you again.

  • My observation is, there is a gap consistent between the training curve and the validation curve in both the loss plot and the MAE plot. This could be possible because there is a systematic difference between your data in training set & and validation set, which is obvious from your notebook’s time series plot. Given that the gap is very consistent, you may consider to artificially adjust the prediction by a constant so they match again.
  • On the other hand, it is not very clear but in the MAE plot, the validation curve seems to be dropping with the training curve over epochs, only that the fluctuation in validation curve is too large so the trend isn’t clear. I can’t conclude what’s wrong just by looking at these plots, but you may keep everything else unchanged and reduce dramatically the size for LSTM (e.g. from 180 to 45) just to see if the fluatuation will become smaller or not.
  1. In my opinion, all curve pairs in 1.1, 1.2, 1.3 are moving togather. 1.3 looks smoother probably just because of its y-scale is larger.

  2. before talking about this, try normalization, and can you please give a brief introduction on the problem being solves? Are you, given a sample of 8000 time steps, trying to predict the value at the 8001 time step?

  3. Suggest you to always do normalization because gradient descent is sensitive to this.


1 Like

Hi Raymond @rmwkwok,

Thank you very much for your great and helpful feedback.

A: More information on project and objectives
A1: Contiguous data
My final objective is:
Inputs : historical data, daily integer counts of observations of species (multivariate)
Outputs: Single day prediction for each species (multivariate with single time step)

My actual objective is:
To study the data and lessen complexity in a first phase. The model I made here in the notebook is a univariate input with the corresponding univariate single step output.

A2: Regression is favored
Only endogenous variable are considered (prediction depending on past observations t-1 …).
Exogenous variables (other types of variable that may be used for prediction) will not be considered in this first phase.

A3: Structured data / time window
Trend and seasonal cycles are present.
Intuitively, we could think that a time window of 365 days would be more appropriated, but the test made gave best results when using a time window of 180 days.

A4: Static vs dynamic
I am still thinking about the best way to do it.
Maybe train a model every month (based on a 180 time window) and use this model during 4 weeks to make prediction every days based on the last 180 observations ? That would mean retraining the model once a month, as it might take quite a lot of computing resources to train it more often. Were you thinking about about a better way to do it?

A5: Train vs validation set
I have 8’000 observation at disposal. The whole is divided into a train and validation set. To make sure that the validation set contained a full seasonal cycle, I observed the plot and put the threshold (red vertical line) at time step 6’500. Based on this, my trainSet is obs[0:6500] and my validSet is obs[6500: ]. Would you see it another way?

Raymond @rmwkwok
You found the problem, awseome! It was indeed as you said, a matter of normalizing the data priorehand. You can see the results here if you want:
1_TSModelWithNorm.ipynb (215.5 KB)

So that solves my main problem perfectly, I will now have to redo all my tests again from the beginning to optimize the network.

Thanks a lot Raymond!

  1. Perhaps we should check this: with a trained model, make many predictions. group the prediction for the first day of the 4 weeks, calculate a mae; another mae for the 2nd day; and so on. See if the mae goes up over the 4 weeks.
    And since you will end up with 7*4=28 mae. Plot the 28 mae. If you care not just about the mae, but also the fluctuation of mae among the first day, among the second day, and so on, you may make a second plot that is only for the standard deviation of absolute errors on each day of the 4 weeks.

I think this is good!

Some comments / questions:

  1. If I understand the numbers correctly, your predicted values ranged between 0 and 1. Your mae is at the level of 0.0004, which is roughly 0.1/0.0001 ~ 0.1% error which is (please bear with me) suspiously too good. i guess this is because you are predicting the 181th day on previous 180 days of data and the day-to-day fluctuation isn’t too large (so that the model may rely on the 180th day to predict the 181th). You may see a different mae size if you do the above check (0).

  2. In the time series, there are zero levels in each yearly cycle, and the timing of the zeros seem pretty consistent over all the years. Looks like you are not handling the zeros specially. Are those zeros representing observations of zero or are they missing values? If they are missing, do you expect to make predictions in those periods? If you expect to, this can be challenging, but let’s talk about it if you really expect to.

  3. If and when you do (0), it’d be useful to make one more additional plot that has two time series - observation and predictions. According to your current plan, you predict 4 weeks based on previous 180 days, so over a year a time, there will be 12 predictions to cover the whole year (12x4 weeks). So I would make the 12 predictions, concatenate them to form a prediction time series, and plot it together with the observation over the same period of time. This may reveal a lot of interesting behavior/problem of your model.


1 Like

Hello Raymond @rmwkwok,

Answers to your points:

  1. Great, indeed, I will do it this way, this will provide me with a clear indicator about when should the model be retrained

  2. Model suspiciously too good. I totally agree. I am struggling to understand what is going on behind the scene after applying normalization.
    As you will see in the notebook attached:
    1_compare.ipynb (357.5 KB)
    the model without normalized data is doing quite well when its predictions are projected on real validation data (section 1.2). But when data are normalized, my usual tool to optimize SGD do not work anymore (section 2.2.1). So I used Adam instead, and when training the model, you see with the verbose that the drop in MAE is immediate, already after the 1st epoch. Finally, when prediction are projected on real validation data, the normalized model is just giving out a straight line. Any thoughts about this ?

  3. The zeros are real values meaning absence. They are not missing values. They need to be used as such, and in winter you may have lots of 0.

  4. Plot observation vs prediction. I integrated those plot in the notebook. Absolutely, thank you, I will make these plots as soon as I will have solved the issues mentioned in 1.

Thanks Raymond

Hello Manu,

Maybe change from “plt.semilogx” into “plt.semilogy”, and don’t set the range for x and y at first?

Maybe because tf.int16 turns those normalized numbers to zero?

 ds = ds.map(
        lambda X, Y: (tf.cast(X, tf.int16), tf.cast(Y, tf.int16)))  # A 16-bit unsigned integer ranging from 0 to 65535.

Now that I see you overlay prediction on observation, I think maybe it’s better to overlay the error (prediction - observation) and the observation on the same plot.


1 Like

Hi Rayomond @rmwkwok,
Thank you very much for your feedback. I am on vacation and will be away from my computer for 3 weeks, looking forward trying those paths when I will be back.
I didn’t feel smart when I read your comment on tf.int16, I should have seen that, sorry for this.
Take care and talk soon!

Hello Manu,

Forget about tf.int16, now it’s time to enjoy the vacation :slight_smile: 3 week is awesome! Eat well, drink well, rest well, play well :slight_smile: , and a lot of great experiences are coming.