CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

I know, many topics like these have been created already, and I apologize in advance.

However, it’s been almost two hours and I’m still getting the error, thus being unable to complete the last programming excercise of the last week of the course.

Other topic’s solution imply using some menu options that seems to have been removed:

  • Help → Get Latest Version | Help → Reboot
  • Help → Reboot Server

Can someone indicate where this options can be found or otherwise help me get passed this error? I would really like not to have to wait any longer considering it’s been almost two hours.

Thank you.

Joaco.

Here’s the complete log just in case it’s helpful:

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-3-c78346570a28> in <module>
      4 vgg = tf.keras.applications.VGG19(include_top=False,
      5                                   input_shape=(img_size, img_size, 3),
----> 6                                   weights='pretrained-model/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5')
      7 
      8 vgg.trainable = False

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/applications/vgg19.py in VGG19(include_top, weights, input_tensor, input_shape, pooling, classes, classifier_activation)
    142   x = layers.Conv2D(
    143       64, (3, 3), activation='relu', padding='same', name='block1_conv1')(
--> 144           img_input)
    145   x = layers.Conv2D(
    146       64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    924     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    925       return self._functional_construction_call(inputs, args, kwargs,
--> 926                                                 input_list)
    927 
    928     # Maintains info about the `Layer.call` stack.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1096         # Build layer if applicable (if the `build` method has been
   1097         # overridden).
-> 1098         self._maybe_build(inputs)
   1099         cast_inputs = self._maybe_cast_inputs(inputs, input_list)
   1100 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in _maybe_build(self, inputs)
   2641         # operations.
   2642         with tf_utils.maybe_init_scope(self):
-> 2643           self.build(input_shapes)  # pylint:disable=not-callable
   2644       # We must set also ensure that the layer is marked as built, and the build
   2645       # shape is stored since user defined build functions may not be calling

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py in build(self, input_shape)
    202         constraint=self.kernel_constraint,
    203         trainable=True,
--> 204         dtype=self.dtype)
    205     if self.use_bias:
    206       self.bias = self.add_weight(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint, partitioner, use_resource, synchronization, aggregation, **kwargs)
    612         synchronization=synchronization,
    613         aggregation=aggregation,
--> 614         caching_device=caching_device)
    615     if regularizer is not None:
    616       # TODO(fchollet): in the future, this should be handled at the

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/base.py in _add_variable_with_custom_getter(self, name, shape, dtype, initializer, getter, overwrite, **kwargs_for_getter)
    748         dtype=dtype,
    749         initializer=initializer,
--> 750         **kwargs_for_getter)
    751 
    752     # If we set an initializer and the variable processed it, tracking will not

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer_utils.py in make_variable(name, shape, dtype, initializer, trainable, caching_device, validate_shape, constraint, use_resource, collections, synchronization, aggregation, partitioner)
    143       synchronization=synchronization,
    144       aggregation=aggregation,
--> 145       shape=variable_shape if variable_shape else None)
    146 
    147 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    258   def __call__(cls, *args, **kwargs):
    259     if cls is VariableV1:
--> 260       return cls._variable_v1_call(*args, **kwargs)
    261     elif cls is Variable:
    262       return cls._variable_v2_call(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in _variable_v1_call(cls, initial_value, trainable, collections, validate_shape, caching_device, name, variable_def, dtype, expected_shape, import_scope, constraint, use_resource, synchronization, aggregation, shape)
    219         synchronization=synchronization,
    220         aggregation=aggregation,
--> 221         shape=shape)
    222 
    223   def _variable_v2_call(cls,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in <lambda>(**kwargs)
    197                         shape=None):
    198     """Call on Variable class. Useful to force the signature."""
--> 199     previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
    200     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    201       previous_getter = _make_getter(getter, previous_getter)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs)
   2595         synchronization=synchronization,
   2596         aggregation=aggregation,
-> 2597         shape=shape)
   2598   else:
   2599     return variables.RefVariable(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
--> 264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    265 
    266 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
   1516           aggregation=aggregation,
   1517           shape=shape,
-> 1518           distribute_strategy=distribute_strategy)
   1519 
   1520   def _init_from_args(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
   1649           with ops.name_scope("Initializer"), device_context_manager(None):
   1650             initial_value = ops.convert_to_tensor(
-> 1651                 initial_value() if init_from_fn else initial_value,
   1652                 name="initial_value", dtype=dtype)
   1653           if shape is not None:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/initializers/initializers_v2.py in __call__(self, shape, dtype)
    395        (via `tf.keras.backend.set_floatx(float_dtype)`)
    396     """
--> 397     return super(VarianceScaling, self).__call__(shape, dtype=_get_dtype(dtype))
    398 
    399 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops_v2.py in __call__(self, shape, dtype)
    559     else:
    560       limit = math.sqrt(3.0 * scale)
--> 561       return self._random_generator.random_uniform(shape, -limit, limit, dtype)
    562 
    563   def get_config(self):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops_v2.py in random_uniform(self, shape, minval, maxval, dtype)
   1042       op = random_ops.random_uniform
   1043     return op(
-> 1044         shape=shape, minval=minval, maxval=maxval, dtype=dtype, seed=self.seed)
   1045 
   1046   def truncated_normal(self, shape, mean, stddev, dtype):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
    199     """Call target, and fall back on dispatchers if there is a TypeError."""
    200     try:
--> 201       return target(*args, **kwargs)
    202     except (TypeError, ValueError):
    203       # Note: convert_to_eager_tensor currently raises a ValueError, not a

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py in random_uniform(shape, minval, maxval, dtype, seed, name)
    286     maxval = 1
    287   with ops.name_scope(name, "random_uniform", [shape, minval, maxval]) as name:
--> 288     shape = tensor_util.shape_tensor(shape)
    289     # In case of [0,1) floating results, minval and maxval is unused. We do an
    290     # `is` comparison here since this is cheaper than isinstance or  __eq__.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py in shape_tensor(shape)
   1027       # not convertible to Tensors because of mixed content.
   1028       shape = tuple(map(tensor_shape.dimension_value, shape))
-> 1029   return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
   1030 
   1031 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1497 
   1498     if ret is None:
-> 1499       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1500 
   1501     if ret is NotImplemented:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    336                                          as_ref=False):
    337   _ = as_ref
--> 338   return constant(v, dtype=dtype, name=name)
    339 
    340 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    262   """
    263   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 264                         allow_broadcast=True)
    265 
    266 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    273       with trace.Trace("tf.constant"):
    274         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 275     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    276 
    277   g = ops.get_default_graph()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    298 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    299   """Implementation of eager constant."""
--> 300   t = convert_to_eager_tensor(value, ctx, dtype)
    301   if shape is None:
    302     return t

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
     95     except AttributeError:
     96       dtype = dtypes.as_dtype(dtype).as_datatype_enum
---> 97   ctx.ensure_initialized()
     98   return ops.EagerTensor(value, ctx.device_name, dtype)
     99 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in ensure_initialized(self)
    537         if self._use_tfrt is not None:
    538           pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt)
--> 539         context_handle = pywrap_tfe.TFE_NewContext(opts)
    540       finally:
    541         pywrap_tfe.TFE_DeleteContextOptions(opts)

Try restarting the server.

It is already working. It’s unfortunate that I had to wait for 2+ hours.

Can you share some instructions on how to restart the server next time this happens? I’ve tried “Kernel → Restart & Clear Output” earlier today to no avail.

Thix is an is an issue that Coursera is trying to address.
Somewhere in the Notebook menu is a restart server option.

There isn’t, that’s why I’m asking, but thanks anyway.

This is a hardware error. There is, at least, one broken GPU in the Coursera environment.
We reported this, and have been waiting for their action.

So, just restarting software, like kernel, does not work. We need the infrastructure “re-allocates” GPU for us. If you stay on a notebook, this is still on the same hardware. So, you need to shutdown all with closing a browser, and restart from the beginning.

In some exercise, there is an instruction like this. I think this is what Tom is referring.

I​n this ungraded lab, on the top right you’ll see a button, “Save and close”. Please press it every time when you are no longer running or closing the lab. This helps to release the GPU resources used by your version of the lab, so that other learners can benefit from it as much as you did.

Unfortunately, this option is not prepared for this assignment. I think most similar operation is , from “File”, select “Close and Halt”. If you want to save a notebook, you may want to select “Save and Checkpoint” beforehand. Then, close the browser, and re-open browser.

Depending to workloads at that time, there may not be another free GPU. So, the same GPU may be assigned. In that case, the only one action is just “wait”.

Sorry for your inconvenience. We already reported this hardware issue. Until, Coursera platform team fixes, the only one work-around is the above.

In “Convolutional Neural Networks” course, as far as I know, there are three assignments that use GPU.

Week 2 : Residual Networks
Week 3 : Image Segmentation with U-Net
Week 4 : Art Generation with Neural Style Transfer

So, if you are working on the other assignment, then, you are absolutely safe. :slight_smile:

Thanks @anon57530071 , I am on the road this week and do not have access to the course notebooks.

i’m facing same issue … i have to wait until what?

It’s hardware issue. You can try to get another GPU by resetting a server connection, but all are availability based.
Recently, “reboot server” seems to be added to Lab Help.

So, you can try this.

If you are familiar with NVIDIA tools (nvidia-smi), then, you can check GPU status like this.

If you see uncorrectable error (marked in red dotted circle), it is most likely a broken GPU.

Of course, you may not be able to get a new GPU depending to the current resource allocation status. In that case, you may want to start another topics for a while, then, come back.

There are only three assignments in the Course 4 that use GPU.
Those are as follows

Week 2 : Residual Networks
Week 3 : Programming Assignment: Image Segmentation with U-Net
Week 4 : Art Generation with Neural Style Transfer

Others should be safe, since those do not use GPU.