C3W3 - CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Hi everybody,

I got the following error in the “Distributed Strategies with TF and Keras” lab:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

I tried to turn on and off the colab’s GPU and to change the os.environ[‘CUDA_VISIBLE_DEVICES’] parameter. No success, same error.

Has anyone been through this? Thanks in advance

2 Likes

Hi Thiago,

It might be that you have inadvertently modified a line of code in the notebook which sets the device. Here is a thread which discusses a very similar problem, you can check it out and see if it resolves your issue.

Let me know if the issue persists.

1 Like

I also encountered this error. Then, I tried to rerun whole code changing runtime to GPU and not modifying any code this time, still error remains. Interesting.

Thanks Somesh. I reran the code recently and it worked fine.

Hi, I tried to re run the code, today. Still, same error.

Same error - any guidance, plz?

I’ve run the code and I’m sure I’ve not changed anything and I get the same error. I’ve tried several times and get the same error each time. I eventually gave up, marked the lab complete and moved on.

For what it’s worth, this output is exactly the output in the “Multi-worker training with Keras” Colab from TensorFlow > Learn > TensorFlow Core > Tutorials.

I get the same error. Could it be because we disable the GPU - os.environ[“CUDA_VISIBLE_DEVICES”] = “-1”?

Also got this error (haven’t changed anything)

I ran C3_W3_Lab_1_Distributed_Training.ipynb and at the cell

%%bash
cat job_0.log

I got the error
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server

I’ve change runtime type to GPU. No idea how to continue with this ungraded lab… …

Leaving runtime type to None gives me the error:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Anyone can help please?

I’m able to run the lab without issues on google colab. What’s your environment?

It’s safe to ignore this message since we’re running distributed training on a single machine:
2022-11-13 06:33:09.401194: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Here’s the related note in the assignment:
Disable all GPUs. This prevents errors caused by the workers all trying to use the same GPU. For a real application each worker would be on a different machine.