C3_W1_Lab_2_TFX_Tuner_and_Trainer

Hello Learners,

I am quiet confused with this lab. The thing that confuses me is why in the Lab we are creating three datasets just to deal with one dataset fashion_mnist.

To make it clear what I mean, I will provide the examples here.

The first time we are loading the fashion_mnist dataset.

# Download the dataset

ds, ds_info = tfds.load('fashion_mnist', data_dir=tempdir, with_info=True)

Then the second time we are injesting the dataset to TFX pipeline. At this point of time, I am very confused, is to why we are not using the tf.data.Dataset object that was just created above. Instead we are again injesting a TFRecord by the already downloaded images. Is it because TFX cannot deal with tf.data.Dataset? If yes, how to handle it in an enterprise level projects where the dataset is pure .jpg images and converting them to TFRecord type would be too time consuming e.g 1TB of image dataset.


# Ingest the data through ExampleGen

example_gen = ImportExampleGen(input_base=_data_root, output_config=output)

Then again we are creating a third dataset just for KerasTuner as shown below.


  # Load the dataset. Specify the compression type since it is saved as `.gz`
  return tf.data.TFRecordDataset(filenames, compression_type='GZIP')

I am just confused, why can’t we use the dataset object that was shown in the first example ds.
In an actual enterprise level projects, I am not sure this example would make any sense.
Can someone kindly throw some light on this, it will help me a lot.

Thanks in advance. Happy Learning!

hi @pratsbhatt , let me give you my understanding about this topic.

tfds.load do a few things like download the data, prepare it and return an object of tf.data.Dataset. The tf.data.Dataset provides an API so you can do many things with a large dataset without loading all of it into memory, but the underlying data is stored in different formats like TFRecord or CSV etc.

So the first time we download the data, we get some TFRecord files.

For data ingestion in TFX pipeline, currently it accepts data in CSV, tf.Record and BigQuery formats as described here. So the answer to your question is it does not take tf.data.Dataset directly, but it can digest the underlying TFRecord files. That’s why we use ImportExampleGen and specify the path to TFRecord files.

I think we anyway need to pre-process raw image data and then store the processed image into some formats. TFRecord is a recommended format for that, since it provide good performance for loading/streaming and processing. You only need to convert to TFRecord once.

The third time it actually return a Dataset from the same underlying TFRecord files.

In summary, let’s separate the concept of datasource (where the data is actually stored: local files, cloud storage) and the API which fetch, prefetch, buffer such data for training and inference. Instead of passing around the tf.data.Dataset object around, here the information about the datasource (path to TFRecord files) is passed around and used in many different components.

Hope it helps, and happy learning!
Cuong

4 Likes

Hi Prateek! In addition to what Cuong said, I think there’s just a misunderstanding with the setup of the lab and we might need to revise the markdown for clarity. What you’re merely doing in the first part of the lab is to have a tfrecord in your workspace. Colab doesn’t have that by default so we copy some from TFDS so you will have data to work with in the next sections of the lab. You can disregard the fact that by using tfds.load(), we are also loading a tf.data.Dataset(). That’s just incidental. We could have dropped TFDS and just directly copied an fmnist tfrecord from some other place (just like the Course 2 Week4 Ungraded Lab 3 on CIFAR10, so no tf dataset generated) and the TFX pipeline will stay the same. What you’re after is the tfrecord that is copied locally to your workspace so you can simulate a prod environment where you have tfrecords that contain your raw data. Hope this also helps!

2 Likes

Thank you @tranvinhcuong and @chris.favila for your answers. I am sorry to get back to this post this late.

I understand both of your explanation but what is hard for me to understand is that this course is related to MLOps in Production which means dealing with data probably in ~100GB or more in real world scenario and with my personal experience with the images they are usually stored in the format not TF.Record.

If one needs to pre-process such a huge amount of data (convert it to TF.Record format) to even use TFX pipeline, it somehow feels unrealistic.

One more aspect is that many a times companies use more than one library sometimes such as pytorch etc, I do not have much experience with pytorch but I suspect if we have converted all the raw images to TF.Record, we won’t be able to use pytorch with it.

I hope you understand where I am coming from. A lab showing how easy or difficult a real world example could have been, would be benifitial to the students in my humble opinion.

I am sorry if I have mistaken anything or if my understanding is not right. Feel free to correct it.

Thank you once again for your help and support.

Warm regards,
Prateek Bhatt