Hi, there is this piece of text in the assignment here:
2.1 - ExampleGen¶
The pipeline starts with the ExampleGen component. It will:
- split the data into training and evaluation sets (by default: 2/3 train, 1/3 eval).
- convert each data row into
tf.train.Example
format. This protocol buffer is designed for Tensorflow operations and is used by the TFX components. - compress and save the data collection under the
_pipeline_root
directory for other components to access. These examples are stored inTFRecord
format. This optimizes read and write operations within Tensorflow especially if you have a large collection of data.
My question is: Does ExampleGen output two versions of the data?
I ask because the tf.train.Example and TFRecord formats are considered different in the documentation.
So the full set of tf.train.Example records are written to the disk in one location, but in another location is written the same records in TFRecord format. Is this correct understanding? Why is it optimal? Thank you for helping me learn