Hi, I’ve been working through the lab for Image Classification and Object Localisation, I’ve been through it on Google Colab and had it all work fine, but I’ve also been trying to implement it on my own computer, running the code on my GPU (Nvidia GTX 970).
I’m consistently encountering an OOM error on the very last step of the first epoch, when TensorFlow attempts to make a 3.18GB memory allocation that exceeds my GPU’s memory. This is the extract from the stack trace concerning the Tensor allocation that causes the error:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,16,73,73] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The size of the tensor suggests to me (from looking up the info on MNIST) that the error is occurring when TensorFlow tries to do something with the full 10,000 image test dataset. This also seems to be supported by the fact that (None,73,73,16) matches the output shape of the first conv2d layer, and changing the batch size for the training data has no effect on this allocation and the resulting failure.
I’m assuming the error is arising when the model tries to check the validation accuracy and bounding box mse? Obviously, I can still carry on with the course using Google Colab, but I’m interested to know if anyone can suggest how I might overcome this issue if I was to ever encounter it in my own work? Should I simply try shrinking the validation dataset to smaller and smaller subsets until it works? or is there a way to get TensorFlow to batch the forward pass with the test data so it can fit my GPU memory constraints?