Hi team,

Thank you for the wonderful course. I have a question regarding the train/ eval split on the time series dataset.

In the lab, we use ExampleGen to do train/ eval split. Doesn’t this break the continuity of the timestep?

Say problem is to use HISTORY_SIZE = 2 (**x_t-1, x_t**) to FUTURE_TARGET=1 (y_t+1).

And we have the time series data with 6 data points [(**x0**, y0), (**x1**, y1), …, (**x8**, y8)] where **x_i** is the feature vector at the time index i and y_i is the label at the time step i.

Then we get something like

train: [(**x0**, y0), (**x1**, y1), (**x2**, y2), (**x5**, y5), (**x7**, y7), (**x8**, y8)]

val: [(**x3**, y3), (**x4**, y4), (**x6**, y6)]

Then in the training batch, we can have something like this

training datapoints (features list, target list) : ([**x1**, **x2**], y5)

Does this make sense? Or did I misunderstood something?

I am confused if this is the way to split the train/ eval data especially when we are trying to prepare the data to train LSTM network as stated in the notebook?

Shouldn’t we do it like this?

train: [(**x0**, y0), (**x1**, y1), (**x2**, y2), (**x3**, y3), (**x4**, y4), (**x5**, y5)]

val: [(**x6**, y6), (**x7**, y7), (**x8**, y8)]

But again for this case, we might introduce the distribution shift to the train/ val because the features/ target itself might have trends which vary over time.

Best Regards,

surfii3z