Create the Dataset and Split it into Training and Validation Sets


I’m confused about the specification of the '‘validation_split’ argument (set to 0.2) when defining the train_dataset. I was expecting this to be 0.8 and to be consistent with the output, as shown below. We actually have 327 data points in the dataset, and 80% of which (about 262 images ) has been assigned to the training set. However the argument value 0.2 is getting me confused.

I would be grateful if you could make the point clear to me.


See the subset parameter. That’ll help decide if you’re looking for training or validation split. See tf.keras.utils.image_dataset_from_directory  |  TensorFlow Core v2.7.0

Also, I wanted to check the size of the number of data points after successfully implement the data_augmentation code. I realized that with a PrefetchDataset it’s a bit tricky.
Please can figure out the new size of the train_dataset after augmentation?

1 Like

Yes, the subset param is clear about it. When specifying that it’s a training, I was expecting also the percentage to be consistent with the output (80 percent)? this is why I’m confused

Since we’re specifying the validation_split as .2, train split = 1-.2 = .8
Train dataset size = round(327*.8)=round(261.6)=262

Validation dataset size = round(327*.2) = round(65.4) =65

1 Like

Got it, thanks!
Do you have an answer to my second concern please? (about data augmentation)

Not sure if I understand your question correctly. If you want to find the size of the dataset after image augmentation, say, for the training dataset, it’s ceil(262 / 32) batches, since 32 is the batch size. To elaborate, there will be 8 batches of 32 rows each and 1 batch of 6 rows.
The last batch in this case will be less than 32.

I meant, we had 262 images for the training (train_dataset size), however we wanted to train the model on a lager dataset, reason why we tried the data augmentation strategy. So I want to know the new number of the training data points, after the data augmentation has been applied.

Image augmentation is done on the fly, just before a batch of data is used for training. The augmentation is done by applying the random transformations you configure.

So, to answer your question, the size of the dataset is increased by including these small changes to the images over batches. Although you’re still training over 8 batches of data, they are going to change over time.

So, at the end, how to know how much data the model has been trained on (including both original and augmented data points)?

All you know is the for each epoch, your model will train on 262 images. The exact number of unique images across all epochs is unknown since random changes are being introduced.

Thank you, @balaji.ambresh !

You’re welcome @Bertrand_T_Tameza