Understanding Dataset processing

Spencer_Fonbuena · April 5, 2023, 4:45pm

The following is code given by the instructor

dataset = tf.data.dataset.from_tensor_slices(series)

Window the data but only take those with the specified size

dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)

# Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))

# Create tuples with features and labels 
dataset = dataset.map(lambda window: (window[:-1], window[-1]))

# Shuffle the windows
dataset = dataset.shuffle(shuffle_buffer)

# Create batches of windows
dataset = dataset.batch(batch_size).prefetch(1)

return dataset

The following is how I understand it, but I just want a comprehension check, because I feel like I am likely wrong

the from_tensor_slices will put the array of data into a data object that tensorflow understands
We then window the data. We add 1 to it to account for making the last element in our window the label. (at this point, I’m not sure what the structure of the data is. Assuming we had 100 datapoints, and our window is 10, do we now have a 10 x 10 matrix?)
we then throw our windows into batch form, and flatten it. Again I’m not sure the structure of the data. would it now be 100x1?
We now take the last element in each window, and turn it into the label for our window.
shuffle them
batch them again and fetch for processing?

I guess one of my questions is that it seems like we are doing the same thing a bunch of times. Can someone explain the intricacies of what each of those steps do, and how far I’m off?

balaji.ambresh · April 5, 2023, 6:38pm

A method to understand transformations better is to use Dataset#take.

You can do something like this:

def print_row(a_dataset, msg):
    print(f'Printing a row of data after {msg}')
    for row in a_dataset.take(1):
        print(row)
    print()

def windowed_dataset(series, window_size=G.WINDOW_SIZE, batch_size=G.BATCH_SIZE, shuffle_buffer=G.SHUFFLE_BUFFER_SIZE):
    ds = tf.data.Dataset.from_tensor_slices(series)
    print_row(ds, 'tf.data.Dataset.from_tensor_slices')
    ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    print_row(ds, 'ds.flat_map')
    ds = ds.shuffle(shuffle_buffer)
    ds = ds.map(lambda w: (w[:-1], w[-1]))
    print_row(ds, 'ds.map')
    ds = ds.batch(batch_size).prefetch(1)
    print_row(ds, 'ds.batch')
    return ds


# Apply the transformation to the training set
train_set = windowed_dataset(series_train, window_size=G.WINDOW_SIZE, batch_size=G.BATCH_SIZE, shuffle_buffer=G.SHUFFLE_BUFFER_SIZE)

The output will look like this:

Printing a row of data after tf.data.Dataset.from_tensor_slices
tf.Tensor(20.7, shape=(), dtype=float64)

Printing a row of data after ds.flat_map
tf.Tensor(
[20.7 17.9 18.8 14.6 15.8 15.8 15.8 17.4 21.8 20.  16.2 13.3 16.7 21.5
 25.  20.7 20.6 24.8 17.7 15.5 18.2 12.1 14.4 16.  16.5 18.7 19.4 17.2
 15.5 15.1 15.4 15.3 18.8 21.9 19.9 16.6 16.8 14.6 17.1 25.  15.  13.7
 13.9 18.3 22.  22.1 21.2 18.4 16.6 16.1 15.7 16.6 16.5 14.4 14.4 18.5
 16.9 17.5 21.2 17.8 18.6 17.  16.  13.3 14.3], shape=(65,), dtype=float64)

Printing a row of data after ds.map
(<tf.Tensor: shape=(64,), dtype=float64, numpy=
array([11. , 11.1, 15. , 12.8, 15. , 14.2, 14. , 15.5, 13.3, 15.6, 15.2,
       17.4, 17. , 15. , 13.5, 15.2, 13. , 12.5, 14.1, 14.8, 16.2, 15.8,
       19.1, 22.2, 15.9, 13. , 14.1, 15.8, 24. , 18. , 19.7, 25.2, 20.5,
       19.3, 15.8, 17. , 18.4, 13.3, 14.6, 12.5, 17. , 17.1, 14. , 14.6,
       13.3, 14.8, 15.1, 13.1, 13.6, 19.5, 22.7, 17.2, 13.5, 15.4, 17. ,
       19.2, 22.8, 26.3, 18.2, 17. , 14.8, 12.8, 15.5, 15.6])>, <tf.Tensor: shape=(), dtype=float64, numpy=13.1>)

Printing a row of data after ds.batch
(<tf.Tensor: shape=(32, 64), dtype=float64, numpy=
array([[16.9, 14.7, 10.6, ..., 10. , 11.4, 12.6],
       [14. , 12.5, 11.5, ...,  4.5,  5.7,  5.6],
       [14.3,  8.3,  5.3, ..., 11.8, 10.6, 10. ],
       ...,
       [ 2.5,  5.3,  6.6, ..., 13.6,  8.3,  8.5],
       [11.5, 13.8, 13.3, ..., 17.6, 15.5, 16.7],
       [ 8.3,  5.3,  3. , ..., 10.6, 10. , 12.2]])>, <tf.Tensor: shape=(32,), dtype=float64, numpy=
array([10.7,  7.1, 12.2,  3.5, 15. , 11.9, 17. ,  3.3, 17.4, 12.5, 13.2,
        9.4, 11. , 10.6,  9.9, 12.4,  6.3, 11.3, 12.7, 10.6,  3.1, 15.6,
        7.1, 14. , 17.5, 12.1, 13. ,  6.9, 13. , 12.9, 16.3,  8.9])>)

Spencer_Fonbuena · April 6, 2023, 4:02pm

thats a great resource, thanks!

Topic		Replies	Views
Shuffle in time series Sequences, Time Series and Prediction week-4	13	767	April 28, 2024
TensorFlow preprocessing multivariate time series Using tf.data.Datasets Sequences, Time Series and Prediction week-2	3	444	October 5, 2023
Windowed_dataset function Slicing Problem Sequences, Time Series and Prediction week-4	4	603	November 16, 2021
AttributeError: 'Tensor' object has no attribute 'batch' Sequences, Time Series and Prediction week-2	5	764	April 25, 2024
Cryptic code in "Preparing features and labels" Sequences, Time Series and Prediction week-2	2	570	March 20, 2022

Understanding Dataset processing

Window the data but only take those with the specified size

Related Topics