Course 2 of the specialisation introduces us to the important component units in a data pipeline. One of these are ExampleGen, which automatically splits the data into train and eval.
But for imbalanced datasets, random splitting is not enough.
Is there any way to use ExampleGen to split data in a stratified manner? Or rather, how to do stratified splitting in the TFX pipeline?
Any expert advice would be appreciated.
did you find an answer?
I have the same question… and what to do with 3d shape input of lstm…
I did not find the answer yet.
I have the same doubt. Also, it is not clear to me where is this split to train/eval happening and what are its characteristics, e.g., is it 80/20 split, 70/30 split, or smt else and where to set it up? Furthermore, how would this work with time series, where you dont want to randomly shuffle samples?.. Any info on the topic would be useful to understand this better…