Data augmentation using tf.data

Harshit1097 · January 18, 2023, 10:34am

I’m unable to clearly understand whether data augmentation “increases” the total number of images that are being fed into the network for training or simply transforms each image and then pass on the transformed images to the network for training. For example, I have a batch of 64 images, and I apply tf.image.random_brightness as my augmenter function. Does it increase the total number of images to 128 (original images + tranformed images) and then pass it on to the network? Or does it simply apply random_brightness to each image and pass on this transformed batch of 64 images to the network?

Kic · January 18, 2023, 11:23am

Hi @Harshit1097 ,

Please follow this link to the reference menu for that function.

Juan_Olano · January 18, 2023, 4:11pm

Hi @Harshit1097 ,

Data augmentation is a tool to increase your dataset. Deep learning algorithms need a lot of data to learn, to be trained. Some times we don’t have a lot of data, so a way to add data is by augmenting it. In the case of images, if I have, say, 100 pictures of cats, I can convert that dataset to 500 hundred pictures of cats by rotating to left, to right, inverting, inclining. Those 4 augmentation types will produce 400 more images, and now I have 500 images to feed my model.

Does it make sense?

Juan

Harshit1097 · January 19, 2023, 6:28am

Thanks Juan. The confusion I have is regarding the use of tf.data. I am aware of the fact that using ImageDataGenerator for data augmentation actually increases the number of images by applying the transformations we specify, however I’m unable to find whether tf.data does the same thing.

rmwkwok · January 19, 2023, 6:39am

Hello @Harshit1097,

If we look at this example in the doc page that Kic has shared,

The output takes the same shape as the input (2, 2, 3), so it does NOT produce more number of samples in the output than the number of samples in the input. However, your model DOES see more different images.

For example, let’s say you have only 10 raw images, then you might use tf.image.random_brightness 100 times to generate 1000 different images for an epoch of training. Do you know how to do it with tf.data?

Raymond

Harshit1097 · January 19, 2023, 7:21am

Thanks @rmwkwok. I understood this now.

rmwkwok · January 19, 2023, 7:32am

You are welcome @Harshit1097!

Raymond

Harshit1097 · January 31, 2023, 10:51am

Sorry to bring this up again but I need further clarity. To apply tf.image.random_brightness 100 times on the 10 raw images that I have, I thought the .repeat() function of tf.data.Dataset would be used. But that doesn’t seem to work. Can you please guide through how I can apply my augment function again and again on the raw images?

rmwkwok · January 31, 2023, 11:53am

Hello @Harshit1097 , I would like to do it the other way around.

X = np.random.rand(10, 28, 28, 1)

Given the above X which is 10 samples of one-channel 28x28 images, how did you use repeat and other methods to attempt to generate more samples, albeit failed?

Please feel free to change X to suit your case.

Cheers,
Raymond

Harshit1097 · January 31, 2023, 12:35pm

I’ve done something like this: create dataset object by calling “from_tensor_slices” function on X. Then map the brightness_augment function on X which applies random brightness to the images. Then I shuffle the images and create batches. Then I call the repeat() function which would repeat the dataset REPEAT number of times. Finally prefetching.

def brightness_augment(image, labels):
  image = tf.image.random_brightness(image/255, 0.1)
  image = tf.clip_by_value(image, clip_value_min=0, clip_value_max=1)
  return (image, labels)

def create_dataset(X, labels, REPEAT, is_training):
  dataset = tf.data.Dataset.from_tensor_slices((X, labels))
  if is_training = True:
     dataset = dataset.map(brightness_augment, num_parallel_calls=AUTOTUNE)
     dataset = dataset.shuffle(buffer_size=SHUFFLE_BUFFER_SIZE)
  dataset = dataset.batch(BATCH_SIZE)
  if is_training = True:
     dataset = dataset.repeat(REPEAT)
  dataset = dataset.prefetch(buffer_size=AUTOTUNE)

  return dataset

rmwkwok · January 31, 2023, 12:52pm

You only said it doesnt work, so what exactly is the problem? is it that the final dataset doesn’t have enough number of samples, or some of the samples are repeated?

Harshit1097 · January 31, 2023, 1:07pm

The first issue is that the AUC on validation set has decreased by almost 4 times as compared to the case when I didn’t do any data augmentation, which is counter-intuitive as data augmentation must give similar performance if not better.
Secondly, I am not able to figure out some mathematics when I did model.fit(). My training dataset has 10,000 images. I used a batch size of 32 and REPEAT = 2. While doing model.fit(), I set steps_per_epoch = REPEAT * 10,000 // Batch_size
So while training, there must be total 625 batches in one epoch, but the output shows a total of 3123 batches.

rmwkwok · January 31, 2023, 1:11pm

Hello @Harshit1097,

I won’t talk about model.fit with you now, and I won’t talk about model performance with you now.

The only thing I am interested here is how you generated the data. Let me ask you a few questions:

How many samples do you have before augmentation?
If you run a loop over the augmented dataset, in the loop, take the number of samples out, accumulate that number, and at the end, how many samples do you have in total?
You use repeat after random_brightness. What difference do you expect if you swap the order of them?

Harshit1097 · January 31, 2023, 4:37pm

Hello @rmwkwok.

I had 10,000 samples before augmentation.
By running through the augmented dataset (repeat = 2), I found that there were total 625 batches (i.e. 20,000 samples since batch size = 32) which is the expected number.
I believe If I apply random_brightness after repeat then the augmented images (20,000) will go through random brightness changes, which I think is exactly what I wished to have. I think this is the solution I was looking for

rmwkwok · February 1, 2023, 1:38am

@Harshit1097

Ok. From 2, it seems to me you have rethought something and done some work in your code to get expected results. Great work!

I had to focus on the augmentation part in order for us to confirm that it is delivering expected results. Otherwise, it doesn’t make sense to move on to anything that based on it.

Even though sometimes we can’t control our passion to immediately see some training results, no matter the results are good or not, it is always a good practice to carefully and closely examine each small section of code to check that it is delivering what you want it to. It is also possible to consider training only one epoch with less steps, and just to see if something is wrong.

Cheers,
Raymond

Harshit1097 · February 1, 2023, 3:47pm

Thanks @rmwkwok. I understand your point that we need to look at micro things first before jumping on to the result.

Topic		Replies	Views
Data augmentation increases the size of the training set? Convolutional Neural Networks in TensorFlow week-2	6	574	March 25, 2023
C2_W2_DataAugmentation_ImageDataGenerator Convolutional Neural Networks in TensorFlow week-2	5	507	March 19, 2023
Data Augmentation Question Convolutional Neural Networks in TensorFlow week-2	4	488	August 9, 2023
Week2/Assignment2 Convolutional Neural Networks	3	548	September 9, 2021
ImageDataGenerator and Augmentaion: how many images I have after augmentation? Convolutional Neural Networks in TensorFlow week-2	1	598	March 26, 2023

Data augmentation using tf.data

Related topics