C4 W3 Assignment 2 - Help! CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

I keep getting the error: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

At 2.1 - Split Your Dataset into Unmasked and Masked Images Step.

I have not changed any code yet but keeps failing at this step. Please help!


InternalError Traceback (most recent call last)
in
----> 1 image_list_ds = tf.data.Dataset.list_files(image_list, shuffle=False)
2 mask_list_ds = tf.data.Dataset.list_files(mask_list, shuffle=False)
3
4 for path in zip(image_list_ds.take(3), mask_list_ds.take(3)):
5 print(path)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in list_files(file_pattern, shuffle, seed)
1101 shuffle = True
1102 file_pattern = ops.convert_to_tensor(
→ 1103 file_pattern, dtype=dtypes.string, name=“file_pattern”)
1104 matching_files = gen_io_ops.matching_files(file_pattern)
1105

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1497
1498 if ret is None:
→ 1499 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1500
1501 if ret is NotImplemented:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
336 as_ref=False):
337 _ = as_ref
→ 338 return constant(v, dtype=dtype, name=name)
339
340

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
262 “”"
263 return _constant_impl(value, dtype, shape, name, verify_shape=False,
→ 264 allow_broadcast=True)
265
266

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
273 with trace.Trace(“tf.constant”):
274 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
→ 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
276
277 g = ops.get_default_graph()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
298 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
299 “”“Implementation of eager constant.”""
→ 300 t = convert_to_eager_tensor(value, ctx, dtype)
301 if shape is None:
302 return t

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
95 except AttributeError:
96 dtype = dtypes.as_dtype(dtype).as_datatype_enum
—> 97 ctx.ensure_initialized()
98 return ops.EagerTensor(value, ctx.device_name, dtype)
99

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in ensure_initialized(self)
537 if self._use_tfrt is not None:
538 pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt)
→ 539 context_handle = pywrap_tfe.TFE_NewContext(opts)
540 finally:
541 pywrap_tfe.TFE_DeleteContextOptions(opts)

InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

I was able to resolve the issue by getting the latest version and rebooting.

Help → Get Latest Version
Help → Reboot

That error has happened a lot lately. I think the understanding is that it is a resource utilization problem in the backend infrastructure: too many active users for the number of GPUs available. So it may just be that waiting a few minutes to get the latest version happened to put you at a point where the load on the servers was less.

Have you checked with The NVIDIA System Management Interface (nvidia-smi) ? It provides us the current status of GPU devices including memory usage.

Here is an example.

You can use this command line interface from a cell in a Jupyter notebook with adding “!” in front of nvidia-smi command. If “Memory-Usage” shows that someone is already using your resource, one of the ways to use this GPU device is “kill remaining processes”, which may be slightly dangerous. Or, of course, just wait and get another instance which does not have any remaining processes.
In any cases, knowing the GPU memory usage may help you to think about the next step if you get the same error.

2 Likes

Very insightful, thank you.

Thanks for posting this command … I’m running into this error and as far as I can see there are no active processes on the GPU. And very little memory is in use, yet it complains about being out of memory. If anyone has any suggestions on how to unblock myself, that would much appreciated.

Tue May 24 19:48:45 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:00:18.0 Off | 1 |
| N/A 38C P0 42W / 300W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Thanks!

Help → Reboot Server fixes this. I found another thread … sorry for the noise.

Nidhi