TensorFlow OOM Error

Hi, I’ve been working through the lab for Image Classification and Object Localisation, I’ve been through it on Google Colab and had it all work fine, but I’ve also been trying to implement it on my own computer, running the code on my GPU (Nvidia GTX 970).

I’m consistently encountering an OOM error on the very last step of the first epoch, when TensorFlow attempts to make a 3.18GB memory allocation that exceeds my GPU’s memory. This is the extract from the stack trace concerning the Tensor allocation that causes the error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,16,73,73] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

The size of the tensor suggests to me (from looking up the info on MNIST) that the error is occurring when TensorFlow tries to do something with the full 10,000 image test dataset. This also seems to be supported by the fact that (None,73,73,16) matches the output shape of the first conv2d layer, and changing the batch size for the training data has no effect on this allocation and the resulting failure.

I’m assuming the error is arising when the model tries to check the validation accuracy and bounding box mse? Obviously, I can still carry on with the course using Google Colab, but I’m interested to know if anyone can suggest how I might overcome this issue if I was to ever encounter it in my own work? Should I simply try shrinking the validation dataset to smaller and smaller subsets until it works? or is there a way to get TensorFlow to batch the forward pass with the test data so it can fit my GPU memory constraints?

1 Like

Hi @Hapero!
I think that is an issue that has to do with the train_step function, when I reproduce it on my personal computer it rises me that OOM error too. For solving it try to decorate the train_step function like this:

#Decorator @tf.function for faster training in graph mode and avoiding OOM errors
def train_step(self, image_list, gt_boxes, gt_classes, optimizer, fine_tune_variables):

Enabling experimental_relax_shapes, may generate fewer graphs that are less specialized on input shapes and that would help the training loop to work with bigger datasets.

If the train_step funcion works correctly, try decorating the function you think that could potentially be responsible for the OOM with the experimental_relax_shapes option to True.

I wish it helps, tell me if you have any other problem. Keep it up!

1 Like