Hi team,
Thank you for the wonderful course. I have a question regarding the train/ eval split on the time series dataset.
In the lab, we use ExampleGen to do train/ eval split. Doesn’t this break the continuity of the timestep?
Say problem is to use HISTORY_SIZE = 2 (x_t-1, x_t) to FUTURE_TARGET=1 (y_t+1).
And we have the time series data with 6 data points [(x0, y0), (x1, y1), …, (x8, y8)] where x_i is the feature vector at the time index i and y_i is the label at the time step i.
Then we get something like
train: [(x0, y0), (x1, y1), (x2, y2), (x5, y5), (x7, y7), (x8, y8)]
val: [(x3, y3), (x4, y4), (x6, y6)]
Then in the training batch, we can have something like this
training datapoints (features list, target list) : ([x1, x2], y5)
Does this make sense? Or did I misunderstood something?
I am confused if this is the way to split the train/ eval data especially when we are trying to prepare the data to train LSTM network as stated in the notebook?
Shouldn’t we do it like this?
train: [(x0, y0), (x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5)]
val: [(x6, y6), (x7, y7), (x8, y8)]
But again for this case, we might introduce the distribution shift to the train/ val because the features/ target itself might have trends which vary over time.
Best Regards,
surfii3z