C2W2 Technical Question

Hi, there is this piece of text in the assignment here:

2.1 - ExampleGen

The pipeline starts with the ExampleGen component. It will:

  • split the data into training and evaluation sets (by default: 2/3 train, 1/3 eval).
  • convert each data row into tf.train.Example format. This protocol buffer is designed for Tensorflow operations and is used by the TFX components.
  • compress and save the data collection under the _pipeline_root directory for other components to access. These examples are stored in TFRecord format. This optimizes read and write operations within Tensorflow especially if you have a large collection of data.

My question is: Does ExampleGen output two versions of the data?
I ask because the tf.train.Example and TFRecord formats are considered different in the documentation.

So the full set of tf.train.Example records are written to the disk in one location, but in another location is written the same records in TFRecord format. Is this correct understanding? Why is it optimal? Thank you for helping me learn

There’s only 1 type of file stored on disk.

TFRecord format is the storage layer as far as tfx is concerned. tf.train.Example is a serialization mechanism that works with the underlying TFRecord format. You can read about it here

To check this, do the following:

  1. Create a duplicate of the data folder with 1 record (say data2/).
  2. Generate the record generation logic using tfx.components.CsvExampleGen.
  3. Look at the generated records using the get_records function for both Split-train (1 record) and Split-eval (0 records).

There’ll be no other tfrecord artifact generated.

You might find steps to edit metadata useful to make certain cells editable.

Thanks Balaji for the assistance. I’m more seeking to understand the “why” especially because the paragraph in the assignment says it is the optimal way. Based on your helpful link I guess it necessitates understanding “protocol buffers” on my end and some additional resource that shows the TFX architecture at the lower level.