[C2_W4_Lab_1_WeatherData] Question about train/ val split for time series data

Hi team,

Thank you for the wonderful course. I have a question regarding the train/ eval split on the time series dataset.

In the lab, we use ExampleGen to do train/ eval split. Doesn’t this break the continuity of the timestep?

Say problem is to use HISTORY_SIZE = 2 (x_t-1, x_t) to FUTURE_TARGET=1 (y_t+1).

And we have the time series data with 6 data points [(x0, y0), (x1, y1), …, (x8, y8)] where x_i is the feature vector at the time index i and y_i is the label at the time step i.

Then we get something like

train: [(x0, y0), (x1, y1), (x2, y2), (x5, y5), (x7, y7), (x8, y8)]

val: [(x3, y3), (x4, y4), (x6, y6)]

Then in the training batch, we can have something like this

training datapoints (features list, target list) : ([x1, x2], y5)

Does this make sense? Or did I misunderstood something?

I am confused if this is the way to split the train/ eval data especially when we are trying to prepare the data to train LSTM network as stated in the notebook?

Shouldn’t we do it like this?

train: [(x0, y0), (x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5)]

val: [(x6, y6), (x7, y7), (x8, y8)]

But again for this case, we might introduce the distribution shift to the train/ val because the features/ target itself might have trends which vary over time.

Best Regards,

surfii3z

Hi! Welcome to Discourse! Thank you for pointing this out! I think you’re right and the shuffling of the dataset might have affected the periodicity. We’ll investigate this and update the notebook if needed. Thanks again!

Hi @chris.favila T

Thank you for your kind response

I am looking forward for the clarification.

Best

Surfii3z