Thoughts on improving C2_W4_Lab_2_Signals.ipynb

Even though I realize that this is a dummy example, since I have a an expertise on HAR, I would like to propose some alterations of this notebook in order to improve its content (sorry in advance for my long post).

  1. The current test-train split is incorrect. The data points are split randomly, so some a data point collected by subject X in timestamp t may belong to the train set, while its consequent one collected in t+1 may belong to the val set. As you realize, this is catastrophic for may reasons. In addition to this, in HAR in order to avoid per subject overfit, all the activities from a Subject should belong to the same set (i.e., leave n_subjects_out approach).
    Thus, a stratified split approach based on users’ id is more suitable. he following will work:

from tfx.proto import example_gen_pb2

# splits based on ‘user_id’ features, train:eval=3:1.
subject_based_split = example_gen_pb2.Output(

  •         split_config=example_gen_pb2.SplitConfig(splits=[*
  •             example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=3),*
  •             example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=1)*
  •         ],*
  •         partition_feature_name='user_id'))*

# Instantiate ExampleGen with the input CSV dataset
example_gen = CsvExampleGen(input_base=_data_root, output_config=subject_based_split)

# Execute the component

  1. HAR is a multi-class classification task, thus, the labels should also be one-hot encoded

  2. In general, tft.scale_to_z_score is more effective than the tft.scale_by_min_max, of course that’s not necessary

  3. Finally, probably segments containing more than one label should be removed and not apply majority voting, to avoid noisy labels.

Kind regards,

Hi Panagiotis! Thank you very much for these pointers and for sharing your expertise on the subject! I think point #1 re: the data split is a critical fix! We will review these and integrate it into the exercise. Will keep you posted. Thanks again!