C2W2 Technical Question

gerald_wrona · February 7, 2024, 11:40pm

Hi, there is this piece of text in the assignment here:

2.1 - ExampleGen¶

The pipeline starts with the ExampleGen component. It will:

split the data into training and evaluation sets (by default: 2/3 train, 1/3 eval).
convert each data row into tf.train.Example format. This protocol buffer is designed for Tensorflow operations and is used by the TFX components.
compress and save the data collection under the _pipeline_root directory for other components to access. These examples are stored in TFRecord format. This optimizes read and write operations within Tensorflow especially if you have a large collection of data.

My question is: Does ExampleGen output two versions of the data?
I ask because the tf.train.Example and TFRecord formats are considered different in the documentation.

So the full set of tf.train.Example records are written to the disk in one location, but in another location is written the same records in TFRecord format. Is this correct understanding? Why is it optimal? Thank you for helping me learn

balaji.ambresh · February 8, 2024, 4:55am

There’s only 1 type of file stored on disk.

TFRecord format is the storage layer as far as tfx is concerned. tf.train.Example is a serialization mechanism that works with the underlying TFRecord format. You can read about it here

To check this, do the following:

Create a duplicate of the data folder with 1 record (say data2/).
Generate the record generation logic using tfx.components.CsvExampleGen.
Look at the generated records using the get_records function for both Split-train (1 record) and Split-eval (0 records).

There’ll be no other tfrecord artifact generated.

You might find steps to edit metadata useful to make certain cells editable.

gerald_wrona · February 8, 2024, 12:38pm

Thanks Balaji for the assistance. I’m more seeking to understand the “why” especially because the paragraph in the assignment says it is the optimal way. Based on your helpful link I guess it necessitates understanding “protocol buffers” on my end and some additional resource that shows the TFX architecture at the lower level.

Topic		Replies	Views
C3W3 - tf.data (input pipeline) in the context of TFX ExampleGen component Machine Learning Modeling Pipelines in Production	4	579	August 4, 2021
ExampleGen in Production Machine Learning Data Lifecycle in Production	2	565	September 28, 2021
Why ExampleGen generates just train_set and eval_set Machine Learning Modeling Pipelines in Production week-2 , general	5	198	April 10, 2024
C3_W1_Lab_2_TFX_Tuner_and_Trainer Machine Learning Modeling Pipelines in Production	3	603	November 9, 2021
Stratified ExampleGen Machine Learning Data Lifecycle in Production	3	564	October 27, 2021

C2W2 Technical Question

2.1 - ExampleGen¶

Related topics