tf.data.Datasets and TFX

gcarr7 · September 9, 2021, 10:23pm

Hello,

I know at some point that it was mentioned that the ETL process for training data should be optimized with tf.data.Datasets so the CPU and GPU do not have much downtime (something I’ve done before locally), but I haven’t seen this anywhere in the course. The course also didn’t seem to touch much on how to use the Trainer component of TFX. Therefore the question I have is, is ETL optimization something that is handled automatically by TFX, or is it something that must be handled manually as in a local data pipeline using tf.data.Datasets?

Thanks!

chris.favila · September 13, 2021, 9:50am

Hi Gage! Welcome to Discourse and good question! The Transform component stores your files as TFRecords and these are fed to the Tuner and Trainer component as one of its arguments. The _input_fn() function in the trainer.pyof the Week 1 Lab 2 (TFX Tuner and Trainer) takes care of converting these into TF Datasets before starting the training.

def _input_fn(file_pattern,
              tf_transform_output,
              num_epochs=None,
              batch_size=32) -> tf.data.Dataset:
  '''Create batches of features and labels from TF Records

  Args:
    file_pattern - List of files or patterns of file paths containing Example records.
    tf_transform_output - transform output graph
    num_epochs - Integer specifying the number of times to read through the dataset. 
            If None, cycles through the dataset forever.
    batch_size - An int representing the number of records to combine in a single batch.

  Returns:
    A dataset of dict elements, (or a tuple of dict elements and label). 
    Each dict maps feature keys to Tensor or SparseTensor objects.
  '''
  transformed_feature_spec = (
      tf_transform_output.transformed_feature_spec().copy())
  
  dataset = tf.data.experimental.make_batched_features_dataset(
      file_pattern=file_pattern,
      batch_size=batch_size,
      features=transformed_feature_spec,
      reader=_gzip_reader_fn,
      num_epochs=num_epochs,
      label_key=LABEL_KEY)
  
  return dataset

This helper function is called in the run_fn() of the same module to create your train and eval sets.

  # Create batches of data good for 10 epochs
  train_set = _input_fn(fn_args.train_files[0], tf_transform_output, 10)
  val_set = _input_fn(fn_args.eval_files[0], tf_transform_output, 10)

Hope this helps! Will take note of this so we can revise the markdown for clarity. Thanks!

gcarr7 · September 15, 2021, 6:06pm

Thanks, I’m going to try it out this week. I’ll post any insights I have!

Topic		Replies	Views
C3_W1_Lab_2_TFX_Tuner_and_Trainer Machine Learning Modeling Pipelines in Production	3	605	November 9, 2021
Summary of Course 2 Machine Learning Data Lifecycle in Production	2	574	September 29, 2021
Week 1 General Questions on Lab 2 Machine Learning Modeling Pipelines in Production	1	524	July 26, 2022
Stuck on Kubleflow and TFX usages Deploying Machine Learning Models in Production	4	408	November 17, 2023
C3W3 - tf.data (input pipeline) in the context of TFX ExampleGen component Machine Learning Modeling Pipelines in Production	4	581	August 4, 2021

tf.data.Datasets and TFX

Related topics