Understanding Dataset processing

The following is code given by the instructor

dataset = tf.data.dataset.from_tensor_slices(series)

Window the data but only take those with the specified size

dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)

# Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))

# Create tuples with features and labels 
dataset = dataset.map(lambda window: (window[:-1], window[-1]))

# Shuffle the windows
dataset = dataset.shuffle(shuffle_buffer)

# Create batches of windows
dataset = dataset.batch(batch_size).prefetch(1)

return dataset

The following is how I understand it, but I just want a comprehension check, because I feel like I am likely wrong

  1. the from_tensor_slices will put the array of data into a data object that tensorflow understands
  2. We then window the data. We add 1 to it to account for making the last element in our window the label. (at this point, I’m not sure what the structure of the data is. Assuming we had 100 datapoints, and our window is 10, do we now have a 10 x 10 matrix?)
  3. we then throw our windows into batch form, and flatten it. Again I’m not sure the structure of the data. would it now be 100x1?
  4. We now take the last element in each window, and turn it into the label for our window.
  5. shuffle them
  6. batch them again and fetch for processing?

I guess one of my questions is that it seems like we are doing the same thing a bunch of times. Can someone explain the intricacies of what each of those steps do, and how far I’m off?

A method to understand transformations better is to use Dataset#take.

You can do something like this:

def print_row(a_dataset, msg):
    print(f'Printing a row of data after {msg}')
    for row in a_dataset.take(1):
        print(row)
    print()

def windowed_dataset(series, window_size=G.WINDOW_SIZE, batch_size=G.BATCH_SIZE, shuffle_buffer=G.SHUFFLE_BUFFER_SIZE):
    ds = tf.data.Dataset.from_tensor_slices(series)
    print_row(ds, 'tf.data.Dataset.from_tensor_slices')
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    print_row(ds, 'ds.flat_map')
    ds = ds.shuffle(shuffle_buffer)
    ds = ds.map(lambda w: (w[:-1], w[-1]))
    print_row(ds, 'ds.map')
    ds = ds.batch(batch_size).prefetch(1)
    print_row(ds, 'ds.batch')
    return ds


# Apply the transformation to the training set
train_set = windowed_dataset(series_train, window_size=G.WINDOW_SIZE, batch_size=G.BATCH_SIZE, shuffle_buffer=G.SHUFFLE_BUFFER_SIZE)

The output will look like this:

Printing a row of data after tf.data.Dataset.from_tensor_slices
tf.Tensor(20.7, shape=(), dtype=float64)

Printing a row of data after ds.flat_map
tf.Tensor(
[20.7 17.9 18.8 14.6 15.8 15.8 15.8 17.4 21.8 20.  16.2 13.3 16.7 21.5
 25.  20.7 20.6 24.8 17.7 15.5 18.2 12.1 14.4 16.  16.5 18.7 19.4 17.2
 15.5 15.1 15.4 15.3 18.8 21.9 19.9 16.6 16.8 14.6 17.1 25.  15.  13.7
 13.9 18.3 22.  22.1 21.2 18.4 16.6 16.1 15.7 16.6 16.5 14.4 14.4 18.5
 16.9 17.5 21.2 17.8 18.6 17.  16.  13.3 14.3], shape=(65,), dtype=float64)

Printing a row of data after ds.map
(<tf.Tensor: shape=(64,), dtype=float64, numpy=
array([11. , 11.1, 15. , 12.8, 15. , 14.2, 14. , 15.5, 13.3, 15.6, 15.2,
       17.4, 17. , 15. , 13.5, 15.2, 13. , 12.5, 14.1, 14.8, 16.2, 15.8,
       19.1, 22.2, 15.9, 13. , 14.1, 15.8, 24. , 18. , 19.7, 25.2, 20.5,
       19.3, 15.8, 17. , 18.4, 13.3, 14.6, 12.5, 17. , 17.1, 14. , 14.6,
       13.3, 14.8, 15.1, 13.1, 13.6, 19.5, 22.7, 17.2, 13.5, 15.4, 17. ,
       19.2, 22.8, 26.3, 18.2, 17. , 14.8, 12.8, 15.5, 15.6])>, <tf.Tensor: shape=(), dtype=float64, numpy=13.1>)

Printing a row of data after ds.batch
(<tf.Tensor: shape=(32, 64), dtype=float64, numpy=
array([[16.9, 14.7, 10.6, ..., 10. , 11.4, 12.6],
       [14. , 12.5, 11.5, ...,  4.5,  5.7,  5.6],
       [14.3,  8.3,  5.3, ..., 11.8, 10.6, 10. ],
       ...,
       [ 2.5,  5.3,  6.6, ..., 13.6,  8.3,  8.5],
       [11.5, 13.8, 13.3, ..., 17.6, 15.5, 16.7],
       [ 8.3,  5.3,  3. , ..., 10.6, 10. , 12.2]])>, <tf.Tensor: shape=(32,), dtype=float64, numpy=
array([10.7,  7.1, 12.2,  3.5, 15. , 11.9, 17. ,  3.3, 17.4, 12.5, 13.2,
        9.4, 11. , 10.6,  9.9, 12.4,  6.3, 11.3, 12.7, 10.6,  3.1, 15.6,
        7.1, 14. , 17.5, 12.1, 13. ,  6.9, 13. , 12.9, 16.3,  8.9])>)
2 Likes

thats a great resource, thanks!