C3_W2_Lab_1 Questions

I have some questions on the code of C3_W2_Lab_1.

  1. Why specify “pattern” as ‘/tmp/data/taxi-train*’ instead of passing the CSV file?

dataset = tf.data.experimental.make_csv_dataset(pattern, batch_size)

  1. Additionally, what does the following snippet of code do:

dataset = dataset.shuffle(1000).repeat()

Does it create a buffer of size 1000 and randomly pick batch_size (which we set to 32) elements each time? The comment also says the dataset will loop infinitely, but where does the looping occur?

  1. In the features_and_labels method, is “row_data” an OrderedDict of TF tensors?

  2. Going back to my question 2, I am a little confused on where the 1000 came from. It seems here that:
    steps_per_epoch = NUM_TRAIN_EXAMPLES // TRAIN_BATCH_SIZE = 59620 // 32 = 1863

  1. The pattern specifies that all csv files will get picked up. Large csv files are split for the following reasons:
    a. For large datasets, read operations can happen in parallel.
    a. For filesystems that have a size limit on a single csv file size, this method of spliting the entire data makes it possible to store the large dataset.
  2. Please see this link to learn about shuffle.
  3. Use Dataset#take to explore the tuple returned by the map function.
  4. 1000 refers to buffer size. It’s a hyperparameter worth exploring based on the size of a single data point and the available RAM.