Why shuffle the train data C3_W3_Lab1 and Lab 2?

fdam · May 22, 2023, 1:38pm

in the lab 1 and lab 2 data are shuffled using:

train_dataset = train_data.shuffle(BUFFER_SIZE)

To shuffle the data may be is a good idea, but why here?

Furthermore, BUFFER_SIZE is 10000 but in the dataset there are 25000 items: in this way (25000 > 10000) train_dataset will be resampled using only 10000 items. This is useful in some situation, but I don’t understand how this can be useful here…

balaji.ambresh · June 1, 2023, 4:50am

It’s always good to shuffle the underlying dataset so that label ordering is made random. When training with a custom dataset which may be ordered by labels, you can work with rest of the same notebook by changing just the data source.

You are right that 10K buffer size is small for a 25K dataset as indicated here as well. This was just an example of using the shuffle api.

Topic		Replies	Views
Shuffle in time series Sequences, Time Series and Prediction week-4	13	784	April 28, 2024
C3W2 Lab 3 Training the model Natural Language Processing in TensorFlow	1	324	December 28, 2023
Why we are repeating training dataset Machine Learning Modeling Pipelines in Production	1	562	July 27, 2022
Shuffle_buffer_size Sequences, Time Series and Prediction week-4	3	548	June 7, 2022
Creating and randomizing training, dev, and test data sets AI Discussions	11	107	March 29, 2023

Why shuffle the train data C3_W3_Lab1 and Lab 2?

Related topics