Create the Dataset and Split it into Training and Validation Sets

Bertrand_T_Tameza · January 9, 2022, 7:41am

Hi,

I’m confused about the specification of the '‘validation_split’ argument (set to 0.2) when defining the train_dataset. I was expecting this to be 0.8 and to be consistent with the output, as shown below. We actually have 327 data points in the dataset, and 80% of which (about 262 images ) has been assigned to the training set. However the argument value 0.2 is getting me confused.

I would be grateful if you could make the point clear to me.

Thanks

balaji.ambresh · January 9, 2022, 8:36am

See the subset parameter. That’ll help decide if you’re looking for training or validation split. See tf.keras.utils.image_dataset_from_directory | TensorFlow Core v2.7.0

Bertrand_T_Tameza · January 9, 2022, 8:36am

Also, I wanted to check the size of the number of data points after successfully implement the data_augmentation code. I realized that with a PrefetchDataset it’s a bit tricky.
Please can figure out the new size of the train_dataset after augmentation?

Bertrand_T_Tameza · January 9, 2022, 8:39am

Yes, the subset param is clear about it. When specifying that it’s a training, I was expecting also the percentage to be consistent with the output (80 percent)? this is why I’m confused

balaji.ambresh · January 9, 2022, 8:44am

Since we’re specifying the validation_split as .2, train split = 1-.2 = .8
Train dataset size = round(327*.8)=round(261.6)=262

Validation dataset size = round(327*.2) = round(65.4) =65

Bertrand_T_Tameza · January 9, 2022, 8:47am

Got it, thanks!
Do you have an answer to my second concern please? (about data augmentation)

balaji.ambresh · January 9, 2022, 9:00am

Not sure if I understand your question correctly. If you want to find the size of the dataset after image augmentation, say, for the training dataset, it’s ceil(262 / 32) batches, since 32 is the batch size. To elaborate, there will be 8 batches of 32 rows each and 1 batch of 6 rows.
The last batch in this case will be less than 32.

Bertrand_T_Tameza · January 9, 2022, 9:05am

I meant, we had 262 images for the training (train_dataset size), however we wanted to train the model on a lager dataset, reason why we tried the data augmentation strategy. So I want to know the new number of the training data points, after the data augmentation has been applied.

balaji.ambresh · January 9, 2022, 9:25am

Image augmentation is done on the fly, just before a batch of data is used for training. The augmentation is done by applying the random transformations you configure.

So, to answer your question, the size of the dataset is increased by including these small changes to the images over batches. Although you’re still training over 8 batches of data, they are going to change over time.

Bertrand_T_Tameza · January 9, 2022, 9:28am

So, at the end, how to know how much data the model has been trained on (including both original and augmented data points)?

balaji.ambresh · January 9, 2022, 9:30am

All you know is the for each epoch, your model will train on 262 images. The exact number of unique images across all epochs is unknown since random changes are being introduced.

Bertrand_T_Tameza · January 9, 2022, 9:36am

Thank you, @balaji.ambresh !

balaji.ambresh · January 9, 2022, 9:36am

You’re welcome @Bertrand_T_Tameza

Topic		Replies	Views
Validation split Introduction to TF for Artificial Intelligence ... week-module-4	5	540	September 9, 2022
The problem validation_split in the image_dataset_from_directory settings Convolutional Neural Networks coursera-platform	2	490	June 25, 2022
Tensorflow Documentation about image_dataset_from_directory Convolutional Neural Networks coursera-platform	2	570	May 25, 2022
C2_W2_DataAugmentation_ImageDataGenerator Convolutional Neural Networks in TensorFlow week-module-2	5	507	March 19, 2023
Data augmentation increases the size of the training set? Convolutional Neural Networks in TensorFlow week-module-2	6	598	March 25, 2023

Create the Dataset and Split it into Training and Validation Sets

Related topics