# [C2_W4_Lab_1_WeatherData] Question about train/ val split for time series data

Hi team,

Thank you for the wonderful course. I have a question regarding the train/ eval split on the time series dataset.

In the lab, we use ExampleGen to do train/ eval split. Doesn’t this break the continuity of the timestep?

Say problem is to use HISTORY_SIZE = 2 (x_t-1, x_t) to FUTURE_TARGET=1 (y_t+1).

And we have the time series data with 6 data points [(x0, y0), (x1, y1), …, (x8, y8)] where x_i is the feature vector at the time index i and y_i is the label at the time step i.

Then we get something like

train: [(x0, y0), (x1, y1), (x2, y2), (x5, y5), (x7, y7), (x8, y8)]

val: [(x3, y3), (x4, y4), (x6, y6)]

Then in the training batch, we can have something like this

training datapoints (features list, target list) : ([x1, x2], y5)

Does this make sense? Or did I misunderstood something?

I am confused if this is the way to split the train/ eval data especially when we are trying to prepare the data to train LSTM network as stated in the notebook?

Shouldn’t we do it like this?

train: [(x0, y0), (x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5)]

val: [(x6, y6), (x7, y7), (x8, y8)]

But again for this case, we might introduce the distribution shift to the train/ val because the features/ target itself might have trends which vary over time.

Best Regards,

surfii3z

Hi! Welcome to Discourse! Thank you for pointing this out! I think you’re right and the shuffling of the dataset might have affected the periodicity. We’ll investigate this and update the notebook if needed. Thanks again!

Thank you for your kind response

I am looking forward for the clarification.

Best

Surfii3z